Our end goal for this project is to clean the data so that we could then create a machine learning model. We want to see if we are able to predict a person's gender based purely on their candy preferences. Although, you will not be creating a model for this assignment, only cleaning the data. The results of the models that I used after cleaning the data are provided at the end of this notebook.
# initial imports
import pandas as pd
import numpy as np
# Do not change this option; This allows the CodeGrade auto grading to function correctly
pd.set_option('display.max_columns', 20)
Let's start by importing our data and creating a DataFrame called candy
. We need to include encoding='iso-8859-1'
during the import because there are special characters in the data that Pandas doesn't recognize. This happens a lot when attempting to import data where the public is able to input answers, especially if there are foreign language characters included. The normal encoding for Pandas is utf-8
, so changing the encoding allows Pandas to recognize those special characters.
Run the following code, with the encoding argument, and it should import correctly.
# read_csv with iso-8859-1 encoding; using latin-1 would also work here
candy_full = pd.read_csv('candy.csv', encoding='iso-8859-1')
# copy to new DF so that we can have a copy of the original import if needed
candy = candy_full.copy()
candy
Internal ID | Q1: GOING OUT? | Q2: GENDER | Q3: AGE | Q4: COUNTRY | Q5: STATE, PROVINCE, COUNTY, ETC | Q6 | 100 Grand Bar | Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) | Q6 | Any full-sized candy bar | Q6 | Black Jacks | ... | Q8: DESPAIR OTHER | Q9: OTHER COMMENTS | Q10: DRESS | Unnamed: 113 | Q11: DAY | Q12: MEDIA [Daily Dish] | Q12: MEDIA [Science] | Q12: MEDIA [ESPN] | Q12: MEDIA [Yahoo] | Click Coordinates (x, y) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 90258773 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 90272821 | No | Male | 44 | USA | NM | MEH | DESPAIR | JOY | MEH | ... | NaN | Bottom line is Twix is really the only candy w... | White and gold | NaN | Sunday | NaN | 1.0 | NaN | NaN | (84, 25) |
2 | 90272829 | NaN | Male | 49 | USA | Virginia | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 90272840 | No | Male | 40 | us | or | MEH | DESPAIR | JOY | MEH | ... | NaN | Raisins can go to hell | White and gold | NaN | Sunday | NaN | 1.0 | NaN | NaN | (75, 23) |
4 | 90272841 | No | Male | 23 | usa | exton pa | JOY | DESPAIR | JOY | DESPAIR | ... | NaN | NaN | White and gold | NaN | Friday | NaN | 1.0 | NaN | NaN | (70, 10) |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2474 | 90314359 | No | Male | 24 | USA | MD | JOY | DESPAIR | MEH | DESPAIR | ... | Fruit Stripe Gum | NaN | White and gold | NaN | Friday | NaN | NaN | NaN | NaN | NaN |
2475 | 90314580 | No | Female | 33 | USA | New York | MEH | DESPAIR | JOY | NaN | ... | Capers | NaN | Blue and black | NaN | Friday | NaN | 1.0 | NaN | NaN | (70, 26) |
2476 | 90314634 | No | Female | 26 | USA | Tennessee | MEH | DESPAIR | JOY | DESPAIR | ... | NaN | NaN | Blue and black | NaN | Friday | NaN | 1.0 | NaN | NaN | (67, 35) |
2477 | 90314658 | No | Male | 58 | Usa | North Carolina | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2478 | 90314802 | No | Female | 66 | usa | Pennsylvania | DESPAIR | DESPAIR | JOY | DESPAIR | ... | NaN | You hit all my chocolate highlights, and broug... | White and gold | NaN | Sunday | 1.0 | NaN | NaN | NaN | (19, 26) |
2479 rows × 120 columns
Let's take a brief look at the data by using head()
.
candy.shape
(2479, 120)
# first five rows
candy.head()
Internal ID | Q1: GOING OUT? | Q2: GENDER | Q3: AGE | Q4: COUNTRY | Q5: STATE, PROVINCE, COUNTY, ETC | Q6 | 100 Grand Bar | Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) | Q6 | Any full-sized candy bar | Q6 | Black Jacks | ... | Q8: DESPAIR OTHER | Q9: OTHER COMMENTS | Q10: DRESS | Unnamed: 113 | Q11: DAY | Q12: MEDIA [Daily Dish] | Q12: MEDIA [Science] | Q12: MEDIA [ESPN] | Q12: MEDIA [Yahoo] | Click Coordinates (x, y) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 90258773 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 90272821 | No | Male | 44 | USA | NM | MEH | DESPAIR | JOY | MEH | ... | NaN | Bottom line is Twix is really the only candy w... | White and gold | NaN | Sunday | NaN | 1.0 | NaN | NaN | (84, 25) |
2 | 90272829 | NaN | Male | 49 | USA | Virginia | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 90272840 | No | Male | 40 | us | or | MEH | DESPAIR | JOY | MEH | ... | NaN | Raisins can go to hell | White and gold | NaN | Sunday | NaN | 1.0 | NaN | NaN | (75, 23) |
4 | 90272841 | No | Male | 23 | usa | exton pa | JOY | DESPAIR | JOY | DESPAIR | ... | NaN | NaN | White and gold | NaN | Friday | NaN | 1.0 | NaN | NaN | (70, 10) |
5 rows × 120 columns
Next, run the following code to see information about the DataFrame.
# check info about the DataFrame
candy.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2479 entries, 0 to 2478 Columns: 120 entries, Internal ID to Click Coordinates (x, y) dtypes: float64(4), int64(1), object(115) memory usage: 2.3+ MB
Notice that this did not print the columns as you might be used to seeing. According to the Pandas documentation: "If the DataFrame has more than max_cols columns, the truncated output is used. By default, the setting in pandas.options.display.max_info_columns is used."
We can make the columns display by setting the max_cols
argument equal to the number of columns in the data set.
candy.shape
(2479, 120)
# check info, set max_cols
candy.info(max_cols=120)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2479 entries, 0 to 2478 Data columns (total 120 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Internal ID 2479 non-null int64 1 Q1: GOING OUT? 2368 non-null object 2 Q2: GENDER 2437 non-null object 3 Q3: AGE 2394 non-null object 4 Q4: COUNTRY 2414 non-null object 5 Q5: STATE, PROVINCE, COUNTY, ETC 2377 non-null object 6 Q6 | 100 Grand Bar 1728 non-null object 7 Q6 | Anonymous brown globs that come in black and orange wrappers (a.k.a. Mary Janes) 1741 non-null object 8 Q6 | Any full-sized candy bar 1803 non-null object 9 Q6 | Black Jacks 1517 non-null object 10 Q6 | Bonkers (the candy) 1482 non-null object 11 Q6 | Bonkers (the board game) 1469 non-null object 12 Q6 | Bottle Caps 1709 non-null object 13 Q6 | Box'o'Raisins 1787 non-null object 14 Q6 | Broken glow stick 1767 non-null object 15 Q6 | Butterfinger 1793 non-null object 16 Q6 | Cadbury Creme Eggs 1792 non-null object 17 Q6 | Candy Corn 1796 non-null object 18 Q6 | Candy that is clearly just the stuff given out for free at restaurants 1784 non-null object 19 Q6 | Caramellos 1723 non-null object 20 Q6 | Cash, or other forms of legal tender 1795 non-null object 21 Q6 | Chardonnay 1732 non-null object 22 Q6 | Chick-o-Sticks (we donÕt know what that is) 1528 non-null object 23 Q6 | Chiclets 1764 non-null object 24 Q6 | Coffee Crisp 1621 non-null object 25 Q6 | Creepy Religious comics/Chick Tracts 1771 non-null object 26 Q6 | Dental paraphenalia 1783 non-null object 27 Q6 | Dots 1746 non-null object 28 Q6 | Dove Bars 1773 non-null object 29 Q6 | Fuzzy Peaches 1652 non-null object 30 Q6 | Generic Brand Acetaminophen 1744 non-null object 31 Q6 | Glow sticks 1778 non-null object 32 Q6 | Goo Goo Clusters 1595 non-null object 33 Q6 | Good N' Plenty 1741 non-null object 34 Q6 | Gum from baseball cards 1759 non-null object 35 Q6 | Gummy Bears straight up 1778 non-null object 36 Q6 | Hard Candy 1780 non-null object 37 Q6 | Healthy Fruit 1781 non-null object 38 Q6 | Heath Bar 1763 non-null object 39 Q6 | Hershey's Dark Chocolate 1802 non-null object 40 Q6 | HersheyÕs Milk Chocolate 1803 non-null object 41 Q6 | Hershey's Kisses 1797 non-null object 42 Q6 | Hugs (actual physical hugs) 1762 non-null object 43 Q6 | Jolly Rancher (bad flavor) 1781 non-null object 44 Q6 | Jolly Ranchers (good flavor) 1780 non-null object 45 Q6 | JoyJoy (Mit Iodine!) 1449 non-null object 46 Q6 | Junior Mints 1777 non-null object 47 Q6 | Senior Mints 1532 non-null object 48 Q6 | Kale smoothie 1731 non-null object 49 Q6 | Kinder Happy Hippo 1528 non-null object 50 Q6 | Kit Kat 1801 non-null object 51 Q6 | LaffyTaffy 1739 non-null object 52 Q6 | LemonHeads 1745 non-null object 53 Q6 | Licorice (not black) 1789 non-null object 54 Q6 | Licorice (yes black) 1790 non-null object 55 Q6 | Lindt Truffle 1757 non-null object 56 Q6 | Lollipops 1784 non-null object 57 Q6 | Mars 1750 non-null object 58 Q6 | Maynards 1450 non-null object 59 Q6 | Mike and Ike 1746 non-null object 60 Q6 | Milk Duds 1782 non-null object 61 Q6 | Milky Way 1787 non-null object 62 Q6 | Regular M&Ms 1800 non-null object 63 Q6 | Peanut M&MÕs 1804 non-null object 64 Q6 | Blue M&M's 1748 non-null object 65 Q6 | Red M&M's 1746 non-null object 66 Q6 | Green Party M&M's 1711 non-null object 67 Q6 | Independent M&M's 1662 non-null object 68 Q6 | Abstained from M&M'ing. 1532 non-null object 69 Q6 | Minibags of chips 1751 non-null object 70 Q6 | Mint Kisses 1699 non-null object 71 Q6 | Mint Juleps 1664 non-null object 72 Q6 | Mr. Goodbar 1735 non-null object 73 Q6 | Necco Wafers 1731 non-null object 74 Q6 | Nerds 1752 non-null object 75 Q6 | Nestle Crunch 1777 non-null object 76 Q6 | Now'n'Laters 1657 non-null object 77 Q6 | Peeps 1765 non-null object 78 Q6 | Pencils 1766 non-null object 79 Q6 | Pixy Stix 1753 non-null object 80 Q6 | Real Housewives of Orange County Season 9 Blue-Ray 1722 non-null object 81 Q6 | ReeseÕs Peanut Butter Cups 1796 non-null object 82 Q6 | Reese's Pieces 1784 non-null object 83 Q6 | Reggie Jackson Bar 1460 non-null object 84 Q6 | Rolos 1761 non-null object 85 Q6 | Sandwich-sized bags filled with BooBerry Crunch 1699 non-null object 86 Q6 | Skittles 1769 non-null object 87 Q6 | Smarties (American) 1750 non-null object 88 Q6 | Smarties (Commonwealth) 1573 non-null object 89 Q6 | Snickers 1785 non-null object 90 Q6 | Sourpatch Kids (i.e. abominations of nature) 1737 non-null object 91 Q6 | Spotted Dick 1593 non-null object 92 Q6 | Starburst 1782 non-null object 93 Q6 | Sweet Tarts 1767 non-null object 94 Q6 | Swedish Fish 1760 non-null object 95 Q6 | Sweetums (a friend to diabetes) 1472 non-null object 96 Q6 | Take 5 1557 non-null object 97 Q6 | Tic Tacs 1761 non-null object 98 Q6 | Those odd marshmallow circus peanut things 1739 non-null object 99 Q6 | Three Musketeers 1767 non-null object 100 Q6 | Tolberone something or other 1769 non-null object 101 Q6 | Trail Mix 1767 non-null object 102 Q6 | Twix 1785 non-null object 103 Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein 1683 non-null object 104 Q6 | Vicodin 1686 non-null object 105 Q6 | Whatchamacallit Bars 1652 non-null object 106 Q6 | White Bread 1718 non-null object 107 Q6 | Whole Wheat anything 1728 non-null object 108 Q6 | York Peppermint Patties 1770 non-null object 109 Q7: JOY OTHER 917 non-null object 110 Q8: DESPAIR OTHER 723 non-null object 111 Q9: OTHER COMMENTS 389 non-null object 112 Q10: DRESS 1730 non-null object 113 Unnamed: 113 9 non-null object 114 Q11: DAY 1750 non-null object 115 Q12: MEDIA [Daily Dish] 85 non-null float64 116 Q12: MEDIA [Science] 1376 non-null float64 117 Q12: MEDIA [ESPN] 99 non-null float64 118 Q12: MEDIA [Yahoo] 67 non-null float64 119 Click Coordinates (x, y) 1619 non-null object dtypes: float64(4), int64(1), object(115) memory usage: 2.3+ MB
Of course, if you are just looking for the column names, you can just use a simple for
loop.
# print a list of column names
for col in candy.columns:
print(col)
Internal ID Q1: GOING OUT? Q2: GENDER Q3: AGE Q4: COUNTRY Q5: STATE, PROVINCE, COUNTY, ETC Q6 | 100 Grand Bar Q6 | Anonymous brown globs that come in black and orange wrappers (a.k.a. Mary Janes) Q6 | Any full-sized candy bar Q6 | Black Jacks Q6 | Bonkers (the candy) Q6 | Bonkers (the board game) Q6 | Bottle Caps Q6 | Box'o'Raisins Q6 | Broken glow stick Q6 | Butterfinger Q6 | Cadbury Creme Eggs Q6 | Candy Corn Q6 | Candy that is clearly just the stuff given out for free at restaurants Q6 | Caramellos Q6 | Cash, or other forms of legal tender Q6 | Chardonnay Q6 | Chick-o-Sticks (we donÕt know what that is) Q6 | Chiclets Q6 | Coffee Crisp Q6 | Creepy Religious comics/Chick Tracts Q6 | Dental paraphenalia Q6 | Dots Q6 | Dove Bars Q6 | Fuzzy Peaches Q6 | Generic Brand Acetaminophen Q6 | Glow sticks Q6 | Goo Goo Clusters Q6 | Good N' Plenty Q6 | Gum from baseball cards Q6 | Gummy Bears straight up Q6 | Hard Candy Q6 | Healthy Fruit Q6 | Heath Bar Q6 | Hershey's Dark Chocolate Q6 | HersheyÕs Milk Chocolate Q6 | Hershey's Kisses Q6 | Hugs (actual physical hugs) Q6 | Jolly Rancher (bad flavor) Q6 | Jolly Ranchers (good flavor) Q6 | JoyJoy (Mit Iodine!) Q6 | Junior Mints Q6 | Senior Mints Q6 | Kale smoothie Q6 | Kinder Happy Hippo Q6 | Kit Kat Q6 | LaffyTaffy Q6 | LemonHeads Q6 | Licorice (not black) Q6 | Licorice (yes black) Q6 | Lindt Truffle Q6 | Lollipops Q6 | Mars Q6 | Maynards Q6 | Mike and Ike Q6 | Milk Duds Q6 | Milky Way Q6 | Regular M&Ms Q6 | Peanut M&MÕs Q6 | Blue M&M's Q6 | Red M&M's Q6 | Green Party M&M's Q6 | Independent M&M's Q6 | Abstained from M&M'ing. Q6 | Minibags of chips Q6 | Mint Kisses Q6 | Mint Juleps Q6 | Mr. Goodbar Q6 | Necco Wafers Q6 | Nerds Q6 | Nestle Crunch Q6 | Now'n'Laters Q6 | Peeps Q6 | Pencils Q6 | Pixy Stix Q6 | Real Housewives of Orange County Season 9 Blue-Ray Q6 | ReeseÕs Peanut Butter Cups Q6 | Reese's Pieces Q6 | Reggie Jackson Bar Q6 | Rolos Q6 | Sandwich-sized bags filled with BooBerry Crunch Q6 | Skittles Q6 | Smarties (American) Q6 | Smarties (Commonwealth) Q6 | Snickers Q6 | Sourpatch Kids (i.e. abominations of nature) Q6 | Spotted Dick Q6 | Starburst Q6 | Sweet Tarts Q6 | Swedish Fish Q6 | Sweetums (a friend to diabetes) Q6 | Take 5 Q6 | Tic Tacs Q6 | Those odd marshmallow circus peanut things Q6 | Three Musketeers Q6 | Tolberone something or other Q6 | Trail Mix Q6 | Twix Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein Q6 | Vicodin Q6 | Whatchamacallit Bars Q6 | White Bread Q6 | Whole Wheat anything Q6 | York Peppermint Patties Q7: JOY OTHER Q8: DESPAIR OTHER Q9: OTHER COMMENTS Q10: DRESS Unnamed: 113 Q11: DAY Q12: MEDIA [Daily Dish] Q12: MEDIA [Science] Q12: MEDIA [ESPN] Q12: MEDIA [Yahoo] Click Coordinates (x, y)
This data set is pretty messy. Your goal is now to perform the following actions to get it to the point where it can be passed to a machine learning model.
Note: Unless the instructions ask you to do something different, please always update the original candy
DataFrame for the exercises below. The automatic grading in CodeGrade will check your final DataFrame and ensure that you have performed all required data manipulations. Also, feel free to add additional cells as needed.
Exercise_A: Taking a look at the column names, you may notice that some include the character Õ
. This should instead be an apostrophe '
mark. Rename the column names that include the Õ
character and replace it was an apostrophe.
Remember that you should be updating the candy
DataFrame for the tasks listed as "Exercises" unless told differently.
# Identify the columns with special character Õ
#We find one column under Q6 | HersheyÕs Milk Chocolate
candy.filter(like="col").select_dtypes("object").columns
#This gives us the following, before we replace
#Index(['Q6 | Hershey's Dark Chocolate', 'Q6 | HersheyÕs Milk Chocolate'], dtype='object')
Index(['Q6 | Hershey's Dark Chocolate', 'Q6 | HersheyÕs Milk Chocolate'], dtype='object')
# Now we rename Q6 | HersheyÕs Milk Chocolate column as Q6_new | 'Hershey's Milk Chocolate'
# And in a way we replace now the special character Õ to an apostrophe
candy = candy.rename(columns={"Q6 | HersheyÕs Milk Chocolate": "Q6_new | Hershey's Milk Chocolate"}) ### ENTER CODE HERE ###
candy
Internal ID | Q1: GOING OUT? | Q2: GENDER | Q3: AGE | Q4: COUNTRY | Q5: STATE, PROVINCE, COUNTY, ETC | Q6 | 100 Grand Bar | Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) | Q6 | Any full-sized candy bar | Q6 | Black Jacks | ... | Q8: DESPAIR OTHER | Q9: OTHER COMMENTS | Q10: DRESS | Unnamed: 113 | Q11: DAY | Q12: MEDIA [Daily Dish] | Q12: MEDIA [Science] | Q12: MEDIA [ESPN] | Q12: MEDIA [Yahoo] | Click Coordinates (x, y) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 90258773 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 90272821 | No | Male | 44 | USA | NM | MEH | DESPAIR | JOY | MEH | ... | NaN | Bottom line is Twix is really the only candy w... | White and gold | NaN | Sunday | NaN | 1.0 | NaN | NaN | (84, 25) |
2 | 90272829 | NaN | Male | 49 | USA | Virginia | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 90272840 | No | Male | 40 | us | or | MEH | DESPAIR | JOY | MEH | ... | NaN | Raisins can go to hell | White and gold | NaN | Sunday | NaN | 1.0 | NaN | NaN | (75, 23) |
4 | 90272841 | No | Male | 23 | usa | exton pa | JOY | DESPAIR | JOY | DESPAIR | ... | NaN | NaN | White and gold | NaN | Friday | NaN | 1.0 | NaN | NaN | (70, 10) |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2474 | 90314359 | No | Male | 24 | USA | MD | JOY | DESPAIR | MEH | DESPAIR | ... | Fruit Stripe Gum | NaN | White and gold | NaN | Friday | NaN | NaN | NaN | NaN | NaN |
2475 | 90314580 | No | Female | 33 | USA | New York | MEH | DESPAIR | JOY | NaN | ... | Capers | NaN | Blue and black | NaN | Friday | NaN | 1.0 | NaN | NaN | (70, 26) |
2476 | 90314634 | No | Female | 26 | USA | Tennessee | MEH | DESPAIR | JOY | DESPAIR | ... | NaN | NaN | Blue and black | NaN | Friday | NaN | 1.0 | NaN | NaN | (67, 35) |
2477 | 90314658 | No | Male | 58 | Usa | North Carolina | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2478 | 90314802 | No | Female | 66 | usa | Pennsylvania | DESPAIR | DESPAIR | JOY | DESPAIR | ... | NaN | You hit all my chocolate highlights, and broug... | White and gold | NaN | Sunday | 1.0 | NaN | NaN | NaN | (19, 26) |
2479 rows × 120 columns
#Check 1: We get the following change, after renaming and replacing
candy.filter(like="col").select_dtypes("object").columns
Index(['Q6 | Hershey's Dark Chocolate', 'Q6_new | Hershey's Milk Chocolate'], dtype='object')
#Check#2 Looking into this dataset
candy["Q6 | Hershey's Dark Chocolate"]
0 NaN 1 JOY 2 NaN 3 MEH 4 JOY ... 2474 JOY 2475 MEH 2476 JOY 2477 NaN 2478 JOY Name: Q6 | Hershey's Dark Chocolate, Length: 2479, dtype: object
Q1: How many duplicated rows are there in the file? Assume that a duplicate is any row that is exactly the same as another one. Save this number as Q1
.
#Number of duplicated rows
Q1 = candy.duplicated().sum()
Q1
17
Q2: How many duplicated rows are there in the file if we were to assume that a duplicate is any row with the same Internal ID
number as another. In other words, even if the other values are different, a row would count as a duplicate if it had the same Internal ID
as another. Save this number as Q2
.
#Number of duplicated rows with same Internal ID number
Q2 = candy[candy.duplicated('Internal ID')]
Q2 = len(Q2)
Q2
19
Exercise_B: Drop any duplicates from the candy
DataFrame. Duplicates are to be defined as any row with the same Internal ID
as another. Use the default setting that keeps the first record from the duplicates.
##Drop any duplicates from the candy DataFrame
#subset is used to remove duplicates from specific column
#To remove duplicates from entire dataset, we use: df.drop_duplicates()
# keep option is set to ‘first’ to remove duplicates and keep the first occurrences only
# keep option is set to False to remove all the occurrences of duplicate column(s)
candy = candy.drop_duplicates(subset = 'Internal ID', keep='first')
candy
Internal ID | Q1: GOING OUT? | Q2: GENDER | Q3: AGE | Q4: COUNTRY | Q5: STATE, PROVINCE, COUNTY, ETC | Q6 | 100 Grand Bar | Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) | Q6 | Any full-sized candy bar | Q6 | Black Jacks | ... | Q8: DESPAIR OTHER | Q9: OTHER COMMENTS | Q10: DRESS | Unnamed: 113 | Q11: DAY | Q12: MEDIA [Daily Dish] | Q12: MEDIA [Science] | Q12: MEDIA [ESPN] | Q12: MEDIA [Yahoo] | Click Coordinates (x, y) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 90258773 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 90272821 | No | Male | 44 | USA | NM | MEH | DESPAIR | JOY | MEH | ... | NaN | Bottom line is Twix is really the only candy w... | White and gold | NaN | Sunday | NaN | 1.0 | NaN | NaN | (84, 25) |
2 | 90272829 | NaN | Male | 49 | USA | Virginia | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 90272840 | No | Male | 40 | us | or | MEH | DESPAIR | JOY | MEH | ... | NaN | Raisins can go to hell | White and gold | NaN | Sunday | NaN | 1.0 | NaN | NaN | (75, 23) |
4 | 90272841 | No | Male | 23 | usa | exton pa | JOY | DESPAIR | JOY | DESPAIR | ... | NaN | NaN | White and gold | NaN | Friday | NaN | 1.0 | NaN | NaN | (70, 10) |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2474 | 90314359 | No | Male | 24 | USA | MD | JOY | DESPAIR | MEH | DESPAIR | ... | Fruit Stripe Gum | NaN | White and gold | NaN | Friday | NaN | NaN | NaN | NaN | NaN |
2475 | 90314580 | No | Female | 33 | USA | New York | MEH | DESPAIR | JOY | NaN | ... | Capers | NaN | Blue and black | NaN | Friday | NaN | 1.0 | NaN | NaN | (70, 26) |
2476 | 90314634 | No | Female | 26 | USA | Tennessee | MEH | DESPAIR | JOY | DESPAIR | ... | NaN | NaN | Blue and black | NaN | Friday | NaN | 1.0 | NaN | NaN | (67, 35) |
2477 | 90314658 | No | Male | 58 | Usa | North Carolina | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2478 | 90314802 | No | Female | 66 | usa | Pennsylvania | DESPAIR | DESPAIR | JOY | DESPAIR | ... | NaN | You hit all my chocolate highlights, and broug... | White and gold | NaN | Sunday | 1.0 | NaN | NaN | NaN | (19, 26) |
2460 rows × 120 columns
Exercise_C: Your next task is to remove the following columns from the candy
DataFrame as we will not use these columns for this project. You are welcome to do further analysis on these columns but do not save your analysis in this notebook.
Remove the following columns: Internal ID
, Q5: STATE, PROVINCE, COUNTY, ETC
, Q7: JOY OTHER
, Q8: DESPAIR OTHER
, Q9: OTHER COMMENTS
, Unnamed: 113
, Click Coordinates (x, y)
.
#Remove the following multiple columns from the candy DataFrame:
candy = candy.drop(
columns=["Internal ID","Q5: STATE, PROVINCE, COUNTY, ETC",
"Q7: JOY OTHER", "Q8: DESPAIR OTHER", "Q9: OTHER COMMENTS",
"Unnamed: 113", "Click Coordinates (x, y)"], axis = 1)
candy
Q1: GOING OUT? | Q2: GENDER | Q3: AGE | Q4: COUNTRY | Q6 | 100 Grand Bar | Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) | Q6 | Any full-sized candy bar | Q6 | Black Jacks | Q6 | Bonkers (the candy) | Q6 | Bonkers (the board game) | ... | Q6 | Whatchamacallit Bars | Q6 | White Bread | Q6 | Whole Wheat anything | Q6 | York Peppermint Patties | Q10: DRESS | Q11: DAY | Q12: MEDIA [Daily Dish] | Q12: MEDIA [Science] | Q12: MEDIA [ESPN] | Q12: MEDIA [Yahoo] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | No | Male | 44 | USA | MEH | DESPAIR | JOY | MEH | DESPAIR | DESPAIR | ... | DESPAIR | DESPAIR | DESPAIR | DESPAIR | White and gold | Sunday | NaN | 1.0 | NaN | NaN |
2 | NaN | Male | 49 | USA | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | No | Male | 40 | us | MEH | DESPAIR | JOY | MEH | MEH | DESPAIR | ... | JOY | DESPAIR | DESPAIR | DESPAIR | White and gold | Sunday | NaN | 1.0 | NaN | NaN |
4 | No | Male | 23 | usa | JOY | DESPAIR | JOY | DESPAIR | MEH | DESPAIR | ... | JOY | DESPAIR | DESPAIR | JOY | White and gold | Friday | NaN | 1.0 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2474 | No | Male | 24 | USA | JOY | DESPAIR | MEH | DESPAIR | DESPAIR | MEH | ... | DESPAIR | MEH | DESPAIR | MEH | White and gold | Friday | NaN | NaN | NaN | NaN |
2475 | No | Female | 33 | USA | MEH | DESPAIR | JOY | NaN | NaN | NaN | ... | JOY | DESPAIR | MEH | JOY | Blue and black | Friday | NaN | 1.0 | NaN | NaN |
2476 | No | Female | 26 | USA | MEH | DESPAIR | JOY | DESPAIR | MEH | JOY | ... | MEH | DESPAIR | DESPAIR | MEH | Blue and black | Friday | NaN | 1.0 | NaN | NaN |
2477 | No | Male | 58 | Usa | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2478 | No | Female | 66 | usa | DESPAIR | DESPAIR | JOY | DESPAIR | DESPAIR | DESPAIR | ... | DESPAIR | MEH | DESPAIR | JOY | White and gold | Sunday | 1.0 | NaN | NaN | NaN |
2460 rows × 113 columns
Code Check: As a check for the above exercises, the shape of your data should now be: (2460, 113)
#Code check
candy.shape
(2460, 113)
Exercise_D: Let's now take a look at the Q2: GENDER
column since this will be what we are trying to predict. Take a look at the value counts for this column.
# view value counts for Q2: GENDER
candy['Q2: GENDER'].value_counts()
Male 1466 Female 839 I'd rather not say 83 Other 30 Name: Q2: GENDER, dtype: int64
Q3: How many missing values are in the Q2: GENDER
column? Save this as Q3
.
# We use this to create True/False mask
candy['Q2: GENDER'].isnull()
0 True 1 False 2 False 3 False 4 False ... 2474 False 2475 False 2476 False 2477 False 2478 False Name: Q2: GENDER, Length: 2460, dtype: bool
# Number of missing values from data
Q3 = candy['Q2: GENDER'].isnull().sum()
Q3
42
Exercise_E: Using the candy
DataFrame, remove all rows with a missing value in the Q2: GENDER
column. (This should overwrite and be saved as candy
like you have been doing for the previous exercises.)
# Remove all rows with a missing value in the Q2: GENDER column
candy = candy.dropna(subset=['Q2: GENDER'])
#Candy shape #from 2460 to 2418 = 2460-42
candy.shape
(2418, 113)
#Code check
candy['Q2: GENDER'].isnull().sum()
0
Exercise_F: For this project, we want to use binary classification, which predicts one of two classes. We want to predict between Male
or Female
. Because of this, select only the rows that contain either Male
or Female
in the Q2: GENDER
column.
# Select only the rows that contain either Male or Female in the Q2: GENDER column
candy = candy[candy['Q2: GENDER'].isin(['Male', 'Female'])]
candy
Q1: GOING OUT? | Q2: GENDER | Q3: AGE | Q4: COUNTRY | Q6 | 100 Grand Bar | Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) | Q6 | Any full-sized candy bar | Q6 | Black Jacks | Q6 | Bonkers (the candy) | Q6 | Bonkers (the board game) | ... | Q6 | Whatchamacallit Bars | Q6 | White Bread | Q6 | Whole Wheat anything | Q6 | York Peppermint Patties | Q10: DRESS | Q11: DAY | Q12: MEDIA [Daily Dish] | Q12: MEDIA [Science] | Q12: MEDIA [ESPN] | Q12: MEDIA [Yahoo] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | No | Male | 44 | USA | MEH | DESPAIR | JOY | MEH | DESPAIR | DESPAIR | ... | DESPAIR | DESPAIR | DESPAIR | DESPAIR | White and gold | Sunday | NaN | 1.0 | NaN | NaN |
2 | NaN | Male | 49 | USA | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | No | Male | 40 | us | MEH | DESPAIR | JOY | MEH | MEH | DESPAIR | ... | JOY | DESPAIR | DESPAIR | DESPAIR | White and gold | Sunday | NaN | 1.0 | NaN | NaN |
4 | No | Male | 23 | usa | JOY | DESPAIR | JOY | DESPAIR | MEH | DESPAIR | ... | JOY | DESPAIR | DESPAIR | JOY | White and gold | Friday | NaN | 1.0 | NaN | NaN |
5 | No | Male | NaN | NaN | JOY | DESPAIR | JOY | NaN | NaN | NaN | ... | JOY | DESPAIR | DESPAIR | JOY | NaN | NaN | NaN | 1.0 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2474 | No | Male | 24 | USA | JOY | DESPAIR | MEH | DESPAIR | DESPAIR | MEH | ... | DESPAIR | MEH | DESPAIR | MEH | White and gold | Friday | NaN | NaN | NaN | NaN |
2475 | No | Female | 33 | USA | MEH | DESPAIR | JOY | NaN | NaN | NaN | ... | JOY | DESPAIR | MEH | JOY | Blue and black | Friday | NaN | 1.0 | NaN | NaN |
2476 | No | Female | 26 | USA | MEH | DESPAIR | JOY | DESPAIR | MEH | JOY | ... | MEH | DESPAIR | DESPAIR | MEH | Blue and black | Friday | NaN | 1.0 | NaN | NaN |
2477 | No | Male | 58 | Usa | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2478 | No | Female | 66 | usa | DESPAIR | DESPAIR | JOY | DESPAIR | DESPAIR | DESPAIR | ... | DESPAIR | MEH | DESPAIR | JOY | White and gold | Sunday | 1.0 | NaN | NaN | NaN |
2305 rows × 113 columns
Code Check: As a check for the above exercises, the shape of your data should now be: (2305, 113)
#Code check
candy.shape
(2305, 113)
Now, let's work on filling some of the missing data. There are easier ways to do this with the sklearn library which you will learn about more in the machine learning classes, but for now, let's try to practice our Pandas skills.
Q4: How many missing values are in the Q1: GOING OUT?
column? Save this number as Q4
.
Q4 = candy['Q1: GOING OUT?'].isnull().sum() #77
Q4
77
Exercise_G: For a future analysis question, we are interested in those that we know will definitely go out for Halloween. Because of this, fill all missing values in the Q1: GOING OUT?
column with a No
value.
#Fill all missing values in the Q1: GOING OUT? column with a 'No' value.
candy['Q1: GOING OUT?'].fillna('No', inplace=True)
Code Check: Double check your above work and look at the value counts for the Q1: GOING OUT?
column. Make sure that you only have "Yes" and No" values and that they add up to 2305, which is the number of rows you should have at this step in the assignment.
#Code check #total No + Yes = 2305
candy['Q1: GOING OUT?'].value_counts()
No 2007 Yes 298 Name: Q1: GOING OUT?, dtype: int64
#code check for missing values
candy['Q1: GOING OUT?'].isnull().sum()
0
Q5: To get ready for the next step, let's practice selecting all the columns: going from Q6 | 100 Grand Bar
to Q11: DAY
. Save this slice as Q5
.
#Select all columns going from: Q6 | 100 Grand Bar to Q11: DAY
# Save this slice as Q5
Q5 = candy.loc[:,'Q6 | 100 Grand Bar':'Q11: DAY']
Q5
Q6 | 100 Grand Bar | Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) | Q6 | Any full-sized candy bar | Q6 | Black Jacks | Q6 | Bonkers (the candy) | Q6 | Bonkers (the board game) | Q6 | Bottle Caps | Q6 | Box'o'Raisins | Q6 | Broken glow stick | Q6 | Butterfinger | ... | Q6 | Trail Mix | Q6 | Twix | Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein | Q6 | Vicodin | Q6 | Whatchamacallit Bars | Q6 | White Bread | Q6 | Whole Wheat anything | Q6 | York Peppermint Patties | Q10: DRESS | Q11: DAY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | MEH | DESPAIR | JOY | MEH | DESPAIR | DESPAIR | DESPAIR | DESPAIR | DESPAIR | DESPAIR | ... | DESPAIR | JOY | DESPAIR | DESPAIR | DESPAIR | DESPAIR | DESPAIR | DESPAIR | White and gold | Sunday |
2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | MEH | DESPAIR | JOY | MEH | MEH | DESPAIR | MEH | DESPAIR | DESPAIR | MEH | ... | MEH | JOY | DESPAIR | JOY | JOY | DESPAIR | DESPAIR | DESPAIR | White and gold | Sunday |
4 | JOY | DESPAIR | JOY | DESPAIR | MEH | DESPAIR | MEH | DESPAIR | DESPAIR | MEH | ... | DESPAIR | JOY | MEH | JOY | JOY | DESPAIR | DESPAIR | JOY | White and gold | Friday |
5 | JOY | DESPAIR | JOY | NaN | NaN | NaN | MEH | MEH | DESPAIR | JOY | ... | MEH | JOY | DESPAIR | DESPAIR | JOY | DESPAIR | DESPAIR | JOY | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2474 | JOY | DESPAIR | MEH | DESPAIR | DESPAIR | MEH | MEH | DESPAIR | DESPAIR | MEH | ... | JOY | JOY | MEH | JOY | DESPAIR | MEH | DESPAIR | MEH | White and gold | Friday |
2475 | MEH | DESPAIR | JOY | NaN | NaN | NaN | NaN | DESPAIR | DESPAIR | JOY | ... | DESPAIR | JOY | NaN | NaN | JOY | DESPAIR | MEH | JOY | Blue and black | Friday |
2476 | MEH | DESPAIR | JOY | DESPAIR | MEH | JOY | DESPAIR | MEH | MEH | DESPAIR | ... | MEH | MEH | MEH | JOY | MEH | DESPAIR | DESPAIR | MEH | Blue and black | Friday |
2477 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2478 | DESPAIR | DESPAIR | JOY | DESPAIR | DESPAIR | DESPAIR | DESPAIR | MEH | DESPAIR | JOY | ... | DESPAIR | DESPAIR | DESPAIR | JOY | DESPAIR | MEH | DESPAIR | JOY | White and gold | Sunday |
2305 rows × 105 columns
Exercise_H: Now that you know how to slice the data, fill any missing values in the candy
DataFrame for those columns (going from Q6 | 100 Grand Bar
to Q11: DAY
) with the string NO_ANSWER
.
#Fill any missing values in the candy DataFrame for those columns
# (going from Q6 | 100 Grand Bar to Q11: DAY) with the string NO_ANSWER
#I call this new df as Q51, to latter help us join with anther new df
Q51 = candy.loc[:,'Q6 | 100 Grand Bar':'Q11: DAY'].fillna('NO_ANSWER')
Q51
Q6 | 100 Grand Bar | Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) | Q6 | Any full-sized candy bar | Q6 | Black Jacks | Q6 | Bonkers (the candy) | Q6 | Bonkers (the board game) | Q6 | Bottle Caps | Q6 | Box'o'Raisins | Q6 | Broken glow stick | Q6 | Butterfinger | ... | Q6 | Trail Mix | Q6 | Twix | Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein | Q6 | Vicodin | Q6 | Whatchamacallit Bars | Q6 | White Bread | Q6 | Whole Wheat anything | Q6 | York Peppermint Patties | Q10: DRESS | Q11: DAY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | MEH | DESPAIR | JOY | MEH | DESPAIR | DESPAIR | DESPAIR | DESPAIR | DESPAIR | DESPAIR | ... | DESPAIR | JOY | DESPAIR | DESPAIR | DESPAIR | DESPAIR | DESPAIR | DESPAIR | White and gold | Sunday |
2 | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | ... | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER |
3 | MEH | DESPAIR | JOY | MEH | MEH | DESPAIR | MEH | DESPAIR | DESPAIR | MEH | ... | MEH | JOY | DESPAIR | JOY | JOY | DESPAIR | DESPAIR | DESPAIR | White and gold | Sunday |
4 | JOY | DESPAIR | JOY | DESPAIR | MEH | DESPAIR | MEH | DESPAIR | DESPAIR | MEH | ... | DESPAIR | JOY | MEH | JOY | JOY | DESPAIR | DESPAIR | JOY | White and gold | Friday |
5 | JOY | DESPAIR | JOY | NO_ANSWER | NO_ANSWER | NO_ANSWER | MEH | MEH | DESPAIR | JOY | ... | MEH | JOY | DESPAIR | DESPAIR | JOY | DESPAIR | DESPAIR | JOY | NO_ANSWER | NO_ANSWER |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2474 | JOY | DESPAIR | MEH | DESPAIR | DESPAIR | MEH | MEH | DESPAIR | DESPAIR | MEH | ... | JOY | JOY | MEH | JOY | DESPAIR | MEH | DESPAIR | MEH | White and gold | Friday |
2475 | MEH | DESPAIR | JOY | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | DESPAIR | DESPAIR | JOY | ... | DESPAIR | JOY | NO_ANSWER | NO_ANSWER | JOY | DESPAIR | MEH | JOY | Blue and black | Friday |
2476 | MEH | DESPAIR | JOY | DESPAIR | MEH | JOY | DESPAIR | MEH | MEH | DESPAIR | ... | MEH | MEH | MEH | JOY | MEH | DESPAIR | DESPAIR | MEH | Blue and black | Friday |
2477 | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | ... | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER |
2478 | DESPAIR | DESPAIR | JOY | DESPAIR | DESPAIR | DESPAIR | DESPAIR | MEH | DESPAIR | JOY | ... | DESPAIR | DESPAIR | DESPAIR | JOY | DESPAIR | MEH | DESPAIR | JOY | White and gold | Sunday |
2305 rows × 105 columns
Exercise_I: For all four Q12: Media
columns in the candy
DataFrame, fill the missing values with 0.0
.
#For all four Q12: Media columns in the candy DataFrame
#Fill the missing values with 0.0
#Let's name this new df as Q52
candy.update(candy[['Q12: MEDIA [Daily Dish]',
'Q12: MEDIA [Science]',
'Q12: MEDIA [ESPN]',
'Q12: MEDIA [Yahoo]']].fillna(0))
#1/ Code check for Q12 columns
#After First checking, I name this new df as Q52
Q52 = candy[['Q12: MEDIA [Daily Dish]',
'Q12: MEDIA [Science]',
'Q12: MEDIA [ESPN]',
'Q12: MEDIA [Yahoo]']]
Q52
Q12: MEDIA [Daily Dish] | Q12: MEDIA [Science] | Q12: MEDIA [ESPN] | Q12: MEDIA [Yahoo] | |
---|---|---|---|---|
1 | 0.0 | 1.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 1.0 | 0.0 | 0.0 |
4 | 0.0 | 1.0 | 0.0 | 0.0 |
5 | 0.0 | 1.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... |
2474 | 0.0 | 0.0 | 0.0 | 0.0 |
2475 | 0.0 | 1.0 | 0.0 | 0.0 |
2476 | 0.0 | 1.0 | 0.0 | 0.0 |
2477 | 0.0 | 0.0 | 0.0 | 0.0 |
2478 | 1.0 | 0.0 | 0.0 | 0.0 |
2305 rows × 4 columns
#Now we combine Q51 and Q52 together,and check all missed values filled/replaced
#We use the concat method to join both DFs side by side(hstack)
# We save this new df in candy1 t
candy1 = pd.concat([Q51, Q52], axis=1)
candy1
Q6 | 100 Grand Bar | Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) | Q6 | Any full-sized candy bar | Q6 | Black Jacks | Q6 | Bonkers (the candy) | Q6 | Bonkers (the board game) | Q6 | Bottle Caps | Q6 | Box'o'Raisins | Q6 | Broken glow stick | Q6 | Butterfinger | ... | Q6 | Whatchamacallit Bars | Q6 | White Bread | Q6 | Whole Wheat anything | Q6 | York Peppermint Patties | Q10: DRESS | Q11: DAY | Q12: MEDIA [Daily Dish] | Q12: MEDIA [Science] | Q12: MEDIA [ESPN] | Q12: MEDIA [Yahoo] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | MEH | DESPAIR | JOY | MEH | DESPAIR | DESPAIR | DESPAIR | DESPAIR | DESPAIR | DESPAIR | ... | DESPAIR | DESPAIR | DESPAIR | DESPAIR | White and gold | Sunday | 0.0 | 1.0 | 0.0 | 0.0 |
2 | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | ... | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | 0.0 | 0.0 | 0.0 | 0.0 |
3 | MEH | DESPAIR | JOY | MEH | MEH | DESPAIR | MEH | DESPAIR | DESPAIR | MEH | ... | JOY | DESPAIR | DESPAIR | DESPAIR | White and gold | Sunday | 0.0 | 1.0 | 0.0 | 0.0 |
4 | JOY | DESPAIR | JOY | DESPAIR | MEH | DESPAIR | MEH | DESPAIR | DESPAIR | MEH | ... | JOY | DESPAIR | DESPAIR | JOY | White and gold | Friday | 0.0 | 1.0 | 0.0 | 0.0 |
5 | JOY | DESPAIR | JOY | NO_ANSWER | NO_ANSWER | NO_ANSWER | MEH | MEH | DESPAIR | JOY | ... | JOY | DESPAIR | DESPAIR | JOY | NO_ANSWER | NO_ANSWER | 0.0 | 1.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2474 | JOY | DESPAIR | MEH | DESPAIR | DESPAIR | MEH | MEH | DESPAIR | DESPAIR | MEH | ... | DESPAIR | MEH | DESPAIR | MEH | White and gold | Friday | 0.0 | 0.0 | 0.0 | 0.0 |
2475 | MEH | DESPAIR | JOY | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | DESPAIR | DESPAIR | JOY | ... | JOY | DESPAIR | MEH | JOY | Blue and black | Friday | 0.0 | 1.0 | 0.0 | 0.0 |
2476 | MEH | DESPAIR | JOY | DESPAIR | MEH | JOY | DESPAIR | MEH | MEH | DESPAIR | ... | MEH | DESPAIR | DESPAIR | MEH | Blue and black | Friday | 0.0 | 1.0 | 0.0 | 0.0 |
2477 | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | ... | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | 0.0 | 0.0 | 0.0 | 0.0 |
2478 | DESPAIR | DESPAIR | JOY | DESPAIR | DESPAIR | DESPAIR | DESPAIR | MEH | DESPAIR | JOY | ... | DESPAIR | MEH | DESPAIR | JOY | White and gold | Sunday | 1.0 | 0.0 | 0.0 | 0.0 |
2305 rows × 109 columns
Code Check: As a check for the above code, make sure that there are no missing values left for the Q6
to Q12
columns.
# Code check for missing values: 'Q6 | 100 Grand Bar':'Q12: Media'
candy1.isnull().sum()
Q6 | 100 Grand Bar 0 Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) 0 Q6 | Any full-sized candy bar 0 Q6 | Black Jacks 0 Q6 | Bonkers (the candy) 0 .. Q11: DAY 0 Q12: MEDIA [Daily Dish] 0 Q12: MEDIA [Science] 0 Q12: MEDIA [ESPN] 0 Q12: MEDIA [Yahoo] 0 Length: 109, dtype: int64
candy2
for remaining columns and then next we combine candy1
and cand 2
DataFrames to get new modified candy
dataset¶#Let's assign the rest column a new df: cand2
candy2 = candy.loc[:,'Q1: GOING OUT?':'Q4: COUNTRY']
candy2
Q1: GOING OUT? | Q2: GENDER | Q3: AGE | Q4: COUNTRY | |
---|---|---|---|---|
1 | No | Male | 44 | USA |
2 | No | Male | 49 | USA |
3 | No | Male | 40 | us |
4 | No | Male | 23 | usa |
5 | No | Male | NaN | NaN |
... | ... | ... | ... | ... |
2474 | No | Male | 24 | USA |
2475 | No | Female | 33 | USA |
2476 | No | Female | 26 | USA |
2477 | No | Male | 58 | Usa |
2478 | No | Female | 66 | usa |
2305 rows × 4 columns
# Now let's combine candy1, and candy2 to get modified new candy dataset
# We save it in candy DataFrame
candy = pd.concat([candy2, candy1], axis=1)
candy
Q1: GOING OUT? | Q2: GENDER | Q3: AGE | Q4: COUNTRY | Q6 | 100 Grand Bar | Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) | Q6 | Any full-sized candy bar | Q6 | Black Jacks | Q6 | Bonkers (the candy) | Q6 | Bonkers (the board game) | ... | Q6 | Whatchamacallit Bars | Q6 | White Bread | Q6 | Whole Wheat anything | Q6 | York Peppermint Patties | Q10: DRESS | Q11: DAY | Q12: MEDIA [Daily Dish] | Q12: MEDIA [Science] | Q12: MEDIA [ESPN] | Q12: MEDIA [Yahoo] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | No | Male | 44 | USA | MEH | DESPAIR | JOY | MEH | DESPAIR | DESPAIR | ... | DESPAIR | DESPAIR | DESPAIR | DESPAIR | White and gold | Sunday | 0.0 | 1.0 | 0.0 | 0.0 |
2 | No | Male | 49 | USA | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | ... | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | 0.0 | 0.0 | 0.0 | 0.0 |
3 | No | Male | 40 | us | MEH | DESPAIR | JOY | MEH | MEH | DESPAIR | ... | JOY | DESPAIR | DESPAIR | DESPAIR | White and gold | Sunday | 0.0 | 1.0 | 0.0 | 0.0 |
4 | No | Male | 23 | usa | JOY | DESPAIR | JOY | DESPAIR | MEH | DESPAIR | ... | JOY | DESPAIR | DESPAIR | JOY | White and gold | Friday | 0.0 | 1.0 | 0.0 | 0.0 |
5 | No | Male | NaN | NaN | JOY | DESPAIR | JOY | NO_ANSWER | NO_ANSWER | NO_ANSWER | ... | JOY | DESPAIR | DESPAIR | JOY | NO_ANSWER | NO_ANSWER | 0.0 | 1.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2474 | No | Male | 24 | USA | JOY | DESPAIR | MEH | DESPAIR | DESPAIR | MEH | ... | DESPAIR | MEH | DESPAIR | MEH | White and gold | Friday | 0.0 | 0.0 | 0.0 | 0.0 |
2475 | No | Female | 33 | USA | MEH | DESPAIR | JOY | NO_ANSWER | NO_ANSWER | NO_ANSWER | ... | JOY | DESPAIR | MEH | JOY | Blue and black | Friday | 0.0 | 1.0 | 0.0 | 0.0 |
2476 | No | Female | 26 | USA | MEH | DESPAIR | JOY | DESPAIR | MEH | JOY | ... | MEH | DESPAIR | DESPAIR | MEH | Blue and black | Friday | 0.0 | 1.0 | 0.0 | 0.0 |
2477 | No | Male | 58 | Usa | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | ... | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | 0.0 | 0.0 | 0.0 | 0.0 |
2478 | No | Female | 66 | usa | DESPAIR | DESPAIR | JOY | DESPAIR | DESPAIR | DESPAIR | ... | DESPAIR | MEH | DESPAIR | JOY | White and gold | Sunday | 1.0 | 0.0 | 0.0 | 0.0 |
2305 rows × 113 columns
Q4: COUNTRY
column and see what we can do about it. First, run the code below to look at the different unique values in the data.¶# check unique values, in Q4: COUNTRY column
candy['Q4: COUNTRY'].unique()
array(['USA ', 'USA', 'us', 'usa', nan, 'canada', 'Canada', 'Us', 'US', 'Murica', 'United States', 'uk', 'United Kingdom', 'united states', 'Usa', 'United States ', 'United staes', 'United States of America', 'UAE', 'England', 'UK', 'canada ', 'United states', 'u.s.a.', '35', 'france', 'United States of America ', 'america', 'U.S.A.', 'finland', 'unhinged states', 'Mexico', 'Canada ', 'united states of america', 'US of A', 'The United States', 'North Carolina ', 'Unied States', 'Netherlands', 'germany', 'Europe', 'U S', 'u.s.', 'U.K. ', 'Costa Rica', 'The United States of America', 'unite states', 'U.S.', '46', 'Australia', 'Greece', 'USA? Hard to tell anymore..', "'merica", '45', 'United State', '32', 'France', 'australia', 'Can', 'Canae', 'Trumpistan', 'Ireland', 'United Sates', 'Korea', 'California', 'Unites States', 'Japan', 'USa', 'South africa', 'I pretend to be from Canada, but I am really from the United States.', 'Usa ', 'Uk', 'Germany', 'Canada`', 'Scotland', 'UK ', 'Denmark', 'United Stated', 'France ', 'Switzerland', 'UD', 'Scotland ', 'South Korea', 'New Jersey', 'CANADA', 'Indonesia', 'United ststes', 'America', 'The Netherlands', 'United Statss', 'endland', 'murrika', 'USA! USA! USA!', 'USAA', 'united States ', 'soviet canuckistan', 'N. America', 'Singapore', 'USSA', 'China', 'Taiwan', 'Ireland ', 'hong kong', 'spain', 'Sweden', 'Hong Kong', 'U.S. ', 'Narnia', 'u s a', 'United Statea', 'united ststes', 'subscribe to dm4uz3 on youtube', 'United kingdom', 'USA USA USA!!!!', "I don't know anymore", 'Fear and Loathing'], dtype=object)
Code Check: As a check for the Country column, check to see how many unique values are in the data. You should have 115
different unique values for the Q4: COUNTRY
column. If you have less or more than this number, double check your work above.
# check the Q4: COUNTRY number of unique values
candy['Q4: COUNTRY'].nunique()
115
We want to clean up this data to only include four areas: USA, Canada, Europe (the continent, not necessarily the European Union), and Other.
There are different ways to do this, but I would suggest that you look at the way we handled the property_type
column in the vienna
data set and the code in the amenities_to_columns()
function in the module notebook. These might be a little harder than those examples but they should give you a good baseline approach.
You could use replace()
for this step, and it is fine if you ultimately decide to do this, but I would suggest that you come up with a solution similar to what was shown in the vienna
data cleaning notebook. This method would be much more robust if you had many more values in your data.
I suggest the following order for this section to make it easier:
Other
Other
(doing this step will help when trying to use us
in the next step if you use string methods)USA
CA
EU
Other
Exercise_J: Fill the missing values in the Q4: Country
column with Other
.
#View the Q4: Country column
candy['Q4: COUNTRY']
1 USA 2 USA 3 us 4 usa 5 NaN ... 2474 USA 2475 USA 2476 USA 2477 Usa 2478 usa Name: Q4: COUNTRY, Length: 2305, dtype: object
#Check for value counts
candy['Q4: COUNTRY'].value_counts()
USA 669 United States 475 usa 210 Canada 168 Usa 131 ... United State 1 45 1 'merica 1 USA? Hard to tell anymore.. 1 Fear and Loathing 1 Name: Q4: COUNTRY, Length: 115, dtype: int64
#Fill the missing values in the Q4: Country column with 'Other'
candy['Q4: COUNTRY'] = candy['Q4: COUNTRY'].fillna('Other')
candy['Q4: COUNTRY']
1 USA 2 USA 3 us 4 usa 5 Other ... 2474 USA 2475 USA 2476 USA 2477 Usa 2478 usa Name: Q4: COUNTRY, Length: 2305, dtype: object
Code Check: Double check that there are no missing values in the Q4: COUNTRY
column. Also, double check the unique values to make sure that "Other" was added. This should mean that you now have 116
unique values for this column.
# check missing Q4 values
candy['Q4: COUNTRY'].isnull().sum()
0
# check unique values
candy['Q4: COUNTRY'].nunique()
116
Exercise_K: Combine all Australia entries into Other
. Watch out for capitalization issues. You should have 114
unique values after this step.
#Combine all Australia entries into Other
candy['Q4: COUNTRY'].replace({'Australia':'Other',
'australia':'Other'
}, inplace=True)
# check unique values
candy['Q4: COUNTRY'].nunique()
114
candy['Q4: COUNTRY'] == 'O'
1 False 2 False 3 False 4 False 5 False ... 2474 False 2475 False 2476 False 2477 False 2478 False Name: Q4: COUNTRY, Length: 2305, dtype: bool
Exercise_L: Combine all United States entries together into USA
. These would include the following:
'USA ', 'USA', 'us', 'usa', 'Us', 'US', 'Murica', 'United States', 'united states', 'Usa', 'United States ', 'United staes', 'United States of America', 'United states', 'u.s.a.', 'United States of America ', 'america', 'U.S.A.', 'unhinged states', 'united states of america', 'US of A', 'The United States', 'North Carolina ', 'Unied States', 'U S', 'u.s.', 'The United States of America', 'unite states','U.S.', 'USA? Hard to tell anymore..', "'merica", 'United State', 'United Sates', 'California', 'Unites States', 'USa', 'I pretend to be from Canada, but I am really from the United States.', 'Usa ', 'United Stated', 'New Jersey', 'United ststes', 'America', 'United Statss', 'murrika', 'USA! USA! USA!', 'USAA', 'united States ', 'N. America', 'USSA', 'U.S. ', 'u s a', 'United Statea', 'united ststes', 'USA USA USA!!!!'
#Combine all United States entries together into USA
candy['Q4: COUNTRY'].replace({
'USA ':'USA', 'USA':'USA', 'us':'USA', 'usa':'USA', 'Us':'USA', 'US':'USA',
'Murica':'USA', 'United States':'USA', 'united states':'USA',
'Usa':'USA', 'United States ':'USA', 'United staes':'USA',
'United States of America':'USA', 'United states':'USA',
'u.s.a.':'USA', 'United States of America ':'USA',
'america':'USA', 'U.S.A.':'USA', 'unhinged states':'USA',
'united states of america':'USA', 'US of A':'USA',
'The United States':'USA', 'North Carolina ':'USA',
'Unied States':'USA', 'U S':'USA', 'u.s.':'USA',
'The United States of America':'USA',
'unite states':'USA','U.S.':'USA', 'USA? Hard to tell anymore..':'USA',
"'merica":'USA', 'United State':'USA', 'United Sates':'USA',
'California':'USA', 'Unites States':'USA', 'USa':'USA',
'I pretend to be from Canada, but I am really from the United States.':'USA',
'Usa ':'USA', 'United Stated':'USA', 'New Jersey':'USA', 'United ststes':'USA',
'America':'USA', 'United Statss':'USA', 'murrika':'USA',
'USA! USA! USA!':'USA', 'USAA':'USA', 'united States ':'USA',
'N. America':'USA', 'USSA':'USA', 'U.S. ':'USA', 'u s a':'USA',
'United Statea':'USA', 'united ststes':'USA',
'USA USA USA!!!!':'USA'
}, inplace=True)
Code Check: You should be merging the above values together into 1 (USA
) and be left with 61 unique values after this step (including the USA
value).
# check unique values
candy['Q4: COUNTRY'].nunique()
61
Exercise_M: Combine the Canadian entries (both upper and lower case) and label them as CA
. Be careful as there are extra spaces, characters, and misspellings (Can, Canae).
These values include:
'canada', 'Canada', 'canada ', 'Canada ', 'Can', 'Canae', 'Canada`', 'CANADA'
#Combine the Canadian entries (both upper and lower case)
# And label them as CA
candy['Q4: COUNTRY'].replace({
'canada':'CA', 'Canada':'CA', 'canada ':'CA',
'Canada ':'CA', 'Can':'CA', 'Canae':'CA',
'Canada`':'CA', 'CANADA':'CA'}, inplace=True)
Code Check: You should be merging 8 values together into 1 (CA
) and be left with 54 unique values after this step (including the CA
value).
# check unique values
candy['Q4: COUNTRY'].nunique()
54
Exercise_N: Combine the European entries and label them as EU
. Again, we are looking at the continent of Europe and not necessarily the countries that are a part of the European Union.
These values include:
'uk', 'United Kingdom', 'England', 'UK', 'france', 'finland', 'Netherlands', 'germany', 'Europe', 'U.K. ', 'Greece', 'France', 'Ireland', 'Uk', 'Germany', 'Scotland', 'UK ', 'Denmark', 'France ', 'Switzerland', 'Scotland ', 'The Netherlands', 'Ireland ', 'spain', 'Sweden', 'United kingdom'
candy['Q4: COUNTRY'].replace({
'uk':'EU','United Kingdom':'EU',
'England':'EU', 'UK':'EU', 'france':'EU',
'finland':'EU', 'Netherlands':'EU',
'germany':'EU', 'Europe':'EU', 'U.K. ':'EU',
'Greece':'EU', 'France':'EU', 'Ireland':'EU',
'Uk':'EU', 'Germany':'EU', 'Scotland':'EU',
'UK ':'EU', 'Denmark':'EU', 'France ':'EU',
'Switzerland':'EU', 'Scotland ':'EU',
'The Netherlands':'EU', 'Ireland ':'EU',
'spain':'EU', 'Sweden':'EU',
'United kingdom':'EU'}, inplace=True)
Code Check: You should be merging 26 entries together and be left with 29 unique values after this step (including the EU
value).
# check unique values
candy['Q4: COUNTRY'].nunique()
29
Exercise_O: Finally, combine the other entries and label them as Other
.
#Combine the other entries and label them as 'Other'
candy['Q4: COUNTRY'].replace({
'Murica':'Other','nan':'Other','UAE':'Other','35':'Other',
'unhinged states':'Other','Mexico':'Other','Costa Rica':'Other','46':'Other',
'Other':'Other','45':'Other','32':'Other','Trumpistan':'Other','Korea':'Other',
'Japan':'Other','South africa':'Other','UD':'Other', 'South Korea':'Other',
'Indonesia':'Other','endland':'Other', 'murrika':'Other', 'soviet canuckistan':'Other',
'Singapore':'Other', 'China':'Other','Taiwan':'Other', 'hong kong':'Other', 'spain':'Other',
'Sweden':'Other','Hong Kong':'Other','Narnia':'Other', 'subscribe to dm4uz3 on youtube':'Other',
"I don't know anymore":'Other','Fear and Loathing':'Other'
}, inplace=True)
Code Check: Double check that you only have four unique values in the Q4: COUNTRY
column: USA
, Other
, CA
, and EU
# check unique values
candy['Q4: COUNTRY'].nunique()
4
Nite: we can also perform this with replace() as we did above, but we can also think of a way to do it in one line of code, than writing many in dictionary. Hint: You want to select everything where the Q4: COUNTRY
is not USA, Canada, Europe and code those as 'Other'.
## Alternative option:
# All other values will be grouped as 'other'
# So we can use the following code
# ~ means, negating(take the opposite)
# candy['Q4: COUNTRY'] != 'USA', 'CA', 'EU'
candy.loc[~candy['Q4: COUNTRY'].isin(['USA', 'CA','EU' ]), 'Q4: COUNTRY'] = 'Other'
Q6: To double check that everything was coded correctly, save the value counts of the Q4: COUNTRY
column as Q6
. You can check this once you run your CodeGrade check.
# Save the value counts of the Q4: COUNTRY column as Q6
Q6 = candy['Q4: COUNTRY'].value_counts()
Q6
USA 1955 CA 216 EU 73 Other 61 Name: Q4: COUNTRY, dtype: int64
candy['Q4: COUNTRY'].isnull().sum()
0
Q3: AGE
column. Let's look at all the unique values.¶# check unique age values for the age column
candy['Q3: AGE'].nunique()
95
candy['Q3: AGE'].unique()
array(['44', '49', '40', '23', nan, '53', '33', '43', '56', '64', '37', '48', '54', '36', '45', '25', '34', '35', '38', '58', '50', '47', '16', '52', '63', '65', '41', '27', '31', '59', '61', '46', '42', '62', '29', '39', '32', '28', '69', '67', '30', '22', '51', '70', '24', '19', 'Old enough', '57', '60', '66', '12', 'Many', '55', '72', '?', '21', '11', 'no', '9', '68', '20', '6', '10', '71', '13', '26', '45-55', '7', '39.4', '74', '18', 'older than dirt', '17', '15', '8', '75', '5u', 'Enough', 'Over 50', '90', '76', 'sixty-nine', 'ancient', '77', 'OLD', 'old', '73', '70 1/2', '14', 'MY NAME JEFF', '4', '59 on the day after Halloween', 'old enough', 'your mom', 'I can remember when Java was a cool new language', '60+'], dtype=object)
Again, this is a pretty messy column of data. This is a good example of why those that create online surveys shouldn't allow the individual to just put any value into the field. But it is now our job to clean this up.
Exercise_P: Your task is to put these values into the following categorical bins: unknown
, 17 and under
, 18-25
, 26-35
, 36-45
, 46-55
, and 56+
.
unknown
categoryunknown
category. For example: sixty-nine
should be coded as unknown
, 45-55
should be coded as unknown
, 59 on the day after Halloween
should be coded as unknown
, etc.Index(['unknown', '17 and under', '18-25', '26-35', '36-45', '46-55', '56+'], dtype='object')
First, we will replace any non-numeric value (those with text as mentioned above) with a missing value. This will allow you to turn the other values into floats so that you can bin them. Just don't forget to code the missing values as unknown
when you are done. To replace the non-numeric values, run the following code:
# create True/False index
age_index = candy['Q3: AGE'].str.isnumeric()
# for the index, fill missing values with False
age_index = age_index.fillna(False)
# select Age column for only those False values from index and code as missing
candy.loc[~age_index, 'Q3: AGE'] = np.nan
#Viewing missing values with False indeces
candy['Q3: AGE'].isnull()
1 False 2 False 3 False 4 False 5 True ... 2474 False 2475 False 2476 False 2477 False 2478 False Name: Q3: AGE, Length: 2305, dtype: bool
#Number of missing values in 'Q3: AGE' column
candy['Q3: AGE'].isnull().sum()
60
#Convert to float dtype
candy['Q3: AGE'] = candy['Q3: AGE'].astype(float)
candy['Q3: AGE']
1 44.0 2 49.0 3 40.0 4 23.0 5 NaN ... 2474 24.0 2475 33.0 2476 26.0 2477 58.0 2478 66.0 Name: Q3: AGE, Length: 2305, dtype: float64
##Your task is to put these values into the following categorical bins:
# unknown, 17 and under, 18-25, 26-35, 36-45, 46-55, and 56+
## We firt Create the following bins for the 'Q3: AGE' column:
# 17 and under, 18-25, 26-35, 36-45, 46-55, and 56+
candy.loc[:,'Q3: AGE'] = pd.cut(candy['Q3: AGE'], bins=[0, 17, 25, 35, 45, 55, 100],
labels=['17 and under','18-25', '26-35', '36-45', '46-55', '56+'])
#Check code
candy['Q3: AGE']
1 36-45 2 46-55 3 36-45 4 18-25 5 NaN ... 2474 18-25 2475 26-35 2476 26-35 2477 56+ 2478 56+ Name: Q3: AGE, Length: 2305, dtype: category Categories (6, object): ['17 and under' < '18-25' < '26-35' < '36-45' < '46-55' < '56+']
## add 'unknown' category in the above df(labels)
candy['Q3: AGE'] = candy['Q3: AGE'].cat.add_categories('unknown').cat.reorder_categories(['unknown', '17 and under','18-25', '26-35', '36-45', '46-55', '56+'])
# fill the NaN missing values with 'unknown'
candy.loc[:,'Q3: AGE'] = candy.loc[:,'Q3: AGE'].astype('category').fillna('unknown')
candy['Q3: AGE']
1 36-45 2 46-55 3 36-45 4 18-25 5 unknown ... 2474 18-25 2475 26-35 2476 26-35 2477 56+ 2478 56+ Name: Q3: AGE, Length: 2305, dtype: category Categories (7, object): ['unknown' < '17 and under' < '18-25' < '26-35' < '36-45' < '46-55' < '56+']
Exercise_Q: Double check yourself by checking the categories for the Q3: AGE
column. It should output: Index(['unknown', '17 and under', '18-25', '26-35', '36-45', '46-55', '56+'], dtype='object')
#checking the categories for the Q3: AGE column
bins = [0, 18, 25, 35, 45, 55, 100]
labels =['unknown', '17 and under', '18-25', '26-35', '36-45', '46-55', '56+']
index = pd.DataFrame(candy['Q3: AGE'], labels, bins)
index.index.astype(object)
Index(['unknown', '17 and under', '18-25', '26-35', '36-45', '46-55', '56+'], dtype='object')
Code Check: To double check your above binning worked correctly, your value counts (sorted by the index) should be as follows:
unknown: 60
17 and under: 49
18-25: 85
26-35: 520
36-45: 768
46-55: 525
56+: 298
#Sort values for candy['Q3: AGE'] column, as above
candy['Q3: AGE'].value_counts().sort_index(ascending = True)
unknown 60 17 and under 49 18-25 85 26-35 520 36-45 768 46-55 525 56+ 298 Name: Q3: AGE, dtype: int64
You can also double check some of your work up to this point
by making sure that there are no missing values in the data set anymore.
#Checking missing value
candy['Q3: AGE'].isnull().sum()
0
Code Check: Check to see if there are any missing values in the data set. Your output should show 0
.
#Check to see if there are any missing values in the data set.
# The output should show 0
candy.isnull().sum()
Q1: GOING OUT? 0 Q2: GENDER 0 Q3: AGE 0 Q4: COUNTRY 0 Q6 | 100 Grand Bar 0 .. Q11: DAY 0 Q12: MEDIA [Daily Dish] 0 Q12: MEDIA [Science] 0 Q12: MEDIA [ESPN] 0 Q12: MEDIA [Yahoo] 0 Length: 113, dtype: int64
Exercise_R: Before you move on to the Data Analysis section, reset the index for candy
ensuring that it goes from 0 to n-1.
### Reset index for candy ensuring that it goes from 0 to n-1
# We apply the (start, stop, step) method
candy.index = pd.RangeIndex(start=0, stop=len(candy1), step=1)
candy
Q1: GOING OUT? | Q2: GENDER | Q3: AGE | Q4: COUNTRY | Q6 | 100 Grand Bar | Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) | Q6 | Any full-sized candy bar | Q6 | Black Jacks | Q6 | Bonkers (the candy) | Q6 | Bonkers (the board game) | ... | Q6 | Whatchamacallit Bars | Q6 | White Bread | Q6 | Whole Wheat anything | Q6 | York Peppermint Patties | Q10: DRESS | Q11: DAY | Q12: MEDIA [Daily Dish] | Q12: MEDIA [Science] | Q12: MEDIA [ESPN] | Q12: MEDIA [Yahoo] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | No | Male | 36-45 | USA | MEH | DESPAIR | JOY | MEH | DESPAIR | DESPAIR | ... | DESPAIR | DESPAIR | DESPAIR | DESPAIR | White and gold | Sunday | 0.0 | 1.0 | 0.0 | 0.0 |
1 | No | Male | 46-55 | USA | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | ... | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | 0.0 | 0.0 | 0.0 | 0.0 |
2 | No | Male | 36-45 | USA | MEH | DESPAIR | JOY | MEH | MEH | DESPAIR | ... | JOY | DESPAIR | DESPAIR | DESPAIR | White and gold | Sunday | 0.0 | 1.0 | 0.0 | 0.0 |
3 | No | Male | 18-25 | USA | JOY | DESPAIR | JOY | DESPAIR | MEH | DESPAIR | ... | JOY | DESPAIR | DESPAIR | JOY | White and gold | Friday | 0.0 | 1.0 | 0.0 | 0.0 |
4 | No | Male | unknown | Other | JOY | DESPAIR | JOY | NO_ANSWER | NO_ANSWER | NO_ANSWER | ... | JOY | DESPAIR | DESPAIR | JOY | NO_ANSWER | NO_ANSWER | 0.0 | 1.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2300 | No | Male | 18-25 | USA | JOY | DESPAIR | MEH | DESPAIR | DESPAIR | MEH | ... | DESPAIR | MEH | DESPAIR | MEH | White and gold | Friday | 0.0 | 0.0 | 0.0 | 0.0 |
2301 | No | Female | 26-35 | USA | MEH | DESPAIR | JOY | NO_ANSWER | NO_ANSWER | NO_ANSWER | ... | JOY | DESPAIR | MEH | JOY | Blue and black | Friday | 0.0 | 1.0 | 0.0 | 0.0 |
2302 | No | Female | 26-35 | USA | MEH | DESPAIR | JOY | DESPAIR | MEH | JOY | ... | MEH | DESPAIR | DESPAIR | MEH | Blue and black | Friday | 0.0 | 1.0 | 0.0 | 0.0 |
2303 | No | Male | 56+ | USA | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | ... | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | 0.0 | 0.0 | 0.0 | 0.0 |
2304 | No | Female | 56+ | USA | DESPAIR | DESPAIR | JOY | DESPAIR | DESPAIR | DESPAIR | ... | DESPAIR | MEH | DESPAIR | JOY | White and gold | Sunday | 1.0 | 0.0 | 0.0 | 0.0 |
2305 rows × 113 columns
### Option 2:
#Reset index and retain old index as a column
# candy = candy.reset_index(drop=True)
# candy
I would suggest that you stop here and run your code through CodeGrade to check the previous steps before continuing. Just keep in mind that there are some preparation steps below that will be marked incorrect because you have not yet gotten to them.
Make sure that you are ready to answer any of the following questions about the data set that may appear on your quiz. Please use the cleaned, final candy
data to answer these questions. Note that the answers here may be different than any that appear in the article about this data set or that could be found using Excel. Ours has been altered and cleaned in a different way than the original authors did. Also, please do not use Excel to try to find these answers. First, you may not get the correct answer, and more importantly, we want you to practice your Pandas skills.
Exercise_S: How many rows were in the original, uncleaned data that you imported? How many rows are in the cleaned data? How many did we end up removing from the data set?
##There were 2479 rows in the original uncleaned data that we imported
# 2479 rows × 120 columns
candy_full.shape
(2479, 120)
##we end up with 2305 rows, removing/cleaning from the old data set
#2305 rows × 113 columns
candy.shape
(2305, 113)
# How many did we end up removing from the data set?
# Rows = 2479 rows - 2305 rows = 174 rows
pd.concat([candy_full, candy]).loc[candy_full.index.symmetric_difference(candy.index)]
Internal ID | Q1: GOING OUT? | Q2: GENDER | Q3: AGE | Q4: COUNTRY | Q5: STATE, PROVINCE, COUNTY, ETC | Q6 | 100 Grand Bar | Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) | Q6 | Any full-sized candy bar | Q6 | Black Jacks | ... | Q9: OTHER COMMENTS | Q10: DRESS | Unnamed: 113 | Q11: DAY | Q12: MEDIA [Daily Dish] | Q12: MEDIA [Science] | Q12: MEDIA [ESPN] | Q12: MEDIA [Yahoo] | Click Coordinates (x, y) | Q6_new | Hershey's Milk Chocolate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2305 | 90291228.0 | Yes | Female | 48 | USA | KS | JOY | MEH | JOY | DESPAIR | ... | Smarties (Commonwealth) are awesome. Sweetums ... | Blue and black | NaN | Friday | 1.0 | NaN | NaN | NaN | (31, 5) | NaN |
2306 | 90291230.0 | No | Female | 60 | United States | California | JOY | DESPAIR | JOY | MEH | ... | NaN | Blue and black | NaN | Friday | NaN | 1.0 | NaN | NaN | (58, 6) | NaN |
2307 | 90291318.0 | NaN | Male | 53 | United States | Illinois | MEH | DESPAIR | JOY | MEH | ... | NaN | White and gold | NaN | Friday | NaN | 1.0 | NaN | NaN | (80, 31) | NaN |
2308 | 90291319.0 | No | Male | 37 | us | IL | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2309 | 90291458.0 | Yes | Male | 42 | USA | OH | JOY | DESPAIR | JOY | MEH | ... | NaN | White and gold | NaN | Friday | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2474 | 90314359.0 | No | Male | 24 | USA | MD | JOY | DESPAIR | MEH | DESPAIR | ... | NaN | White and gold | NaN | Friday | NaN | NaN | NaN | NaN | NaN | NaN |
2475 | 90314580.0 | No | Female | 33 | USA | New York | MEH | DESPAIR | JOY | NaN | ... | NaN | Blue and black | NaN | Friday | NaN | 1.0 | NaN | NaN | (70, 26) | NaN |
2476 | 90314634.0 | No | Female | 26 | USA | Tennessee | MEH | DESPAIR | JOY | DESPAIR | ... | NaN | Blue and black | NaN | Friday | NaN | 1.0 | NaN | NaN | (67, 35) | NaN |
2477 | 90314658.0 | No | Male | 58 | Usa | North Carolina | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2478 | 90314802.0 | No | Female | 66 | usa | Pennsylvania | DESPAIR | DESPAIR | JOY | DESPAIR | ... | You hit all my chocolate highlights, and broug... | White and gold | NaN | Sunday | 1.0 | NaN | NaN | NaN | (19, 26) | NaN |
174 rows × 121 columns
Exercise_T: What percentage of respondents are planning to go out trick-or-treating? (Again, make sure that you are using the final, cleaned data for this and all the following questions.)
# number of respondents planning to go out
candy['Q1: GOING OUT?'].value_counts() #-- 298 number of respondents
No 2007 Yes 298 Name: Q1: GOING OUT?, dtype: int64
# percentages of respondents planning to go out
print(candy['Q1: GOING OUT?'].value_counts(normalize=True).sort_index(ascending = False))
print('======================================================')
print('About 12% or 13% respondents are planning to go out')
Yes 0.129284 No 0.870716 Name: Q1: GOING OUT?, dtype: float64 ====================================================== About 12% or 13% respondents are planning to go out
Exercise_U: What percentage of respondents 17 and younger are planning to go out for trick-or-treating?
#percentage of respondents 17 and younger
# Answer: 0.2 (2%)
print(candy['Q3: AGE'].value_counts(normalize=True).sort_index(ascending = True))
print('==========================================================')
print('About 2% respondents 17 and younger are planning to go out')
unknown 0.026030 17 and under 0.021258 18-25 0.036876 26-35 0.225597 36-45 0.333189 46-55 0.227766 56+ 0.129284 Name: Q3: AGE, dtype: float64 ========================================================== About 2% respondents 17 and younger are planning to go out
Exercise_V: What are the total value counts and the normalized percentages of the age categories from the age column?
#Total value counts and the normalized percentages of the age categories
print(candy['Q3: AGE'].value_counts().sum())
print('============================')
print(candy['Q3: AGE'].value_counts(normalize=True).sort_index(ascending = True))
2305 ============================ unknown 0.026030 17 and under 0.021258 18-25 0.036876 26-35 0.225597 36-45 0.333189 46-55 0.227766 56+ 0.129284 Name: Q3: AGE, dtype: float64
# We can print this for a list of column names
for col in candy.columns:
print(col)
Q1: GOING OUT? Q2: GENDER Q3: AGE Q4: COUNTRY Q6 | 100 Grand Bar Q6 | Anonymous brown globs that come in black and orange wrappers (a.k.a. Mary Janes) Q6 | Any full-sized candy bar Q6 | Black Jacks Q6 | Bonkers (the candy) Q6 | Bonkers (the board game) Q6 | Bottle Caps Q6 | Box'o'Raisins Q6 | Broken glow stick Q6 | Butterfinger Q6 | Cadbury Creme Eggs Q6 | Candy Corn Q6 | Candy that is clearly just the stuff given out for free at restaurants Q6 | Caramellos Q6 | Cash, or other forms of legal tender Q6 | Chardonnay Q6 | Chick-o-Sticks (we donÕt know what that is) Q6 | Chiclets Q6 | Coffee Crisp Q6 | Creepy Religious comics/Chick Tracts Q6 | Dental paraphenalia Q6 | Dots Q6 | Dove Bars Q6 | Fuzzy Peaches Q6 | Generic Brand Acetaminophen Q6 | Glow sticks Q6 | Goo Goo Clusters Q6 | Good N' Plenty Q6 | Gum from baseball cards Q6 | Gummy Bears straight up Q6 | Hard Candy Q6 | Healthy Fruit Q6 | Heath Bar Q6 | Hershey's Dark Chocolate Q6_new | Hershey's Milk Chocolate Q6 | Hershey's Kisses Q6 | Hugs (actual physical hugs) Q6 | Jolly Rancher (bad flavor) Q6 | Jolly Ranchers (good flavor) Q6 | JoyJoy (Mit Iodine!) Q6 | Junior Mints Q6 | Senior Mints Q6 | Kale smoothie Q6 | Kinder Happy Hippo Q6 | Kit Kat Q6 | LaffyTaffy Q6 | LemonHeads Q6 | Licorice (not black) Q6 | Licorice (yes black) Q6 | Lindt Truffle Q6 | Lollipops Q6 | Mars Q6 | Maynards Q6 | Mike and Ike Q6 | Milk Duds Q6 | Milky Way Q6 | Regular M&Ms Q6 | Peanut M&MÕs Q6 | Blue M&M's Q6 | Red M&M's Q6 | Green Party M&M's Q6 | Independent M&M's Q6 | Abstained from M&M'ing. Q6 | Minibags of chips Q6 | Mint Kisses Q6 | Mint Juleps Q6 | Mr. Goodbar Q6 | Necco Wafers Q6 | Nerds Q6 | Nestle Crunch Q6 | Now'n'Laters Q6 | Peeps Q6 | Pencils Q6 | Pixy Stix Q6 | Real Housewives of Orange County Season 9 Blue-Ray Q6 | ReeseÕs Peanut Butter Cups Q6 | Reese's Pieces Q6 | Reggie Jackson Bar Q6 | Rolos Q6 | Sandwich-sized bags filled with BooBerry Crunch Q6 | Skittles Q6 | Smarties (American) Q6 | Smarties (Commonwealth) Q6 | Snickers Q6 | Sourpatch Kids (i.e. abominations of nature) Q6 | Spotted Dick Q6 | Starburst Q6 | Sweet Tarts Q6 | Swedish Fish Q6 | Sweetums (a friend to diabetes) Q6 | Take 5 Q6 | Tic Tacs Q6 | Those odd marshmallow circus peanut things Q6 | Three Musketeers Q6 | Tolberone something or other Q6 | Trail Mix Q6 | Twix Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein Q6 | Vicodin Q6 | Whatchamacallit Bars Q6 | White Bread Q6 | Whole Wheat anything Q6 | York Peppermint Patties Q10: DRESS Q11: DAY Q12: MEDIA [Daily Dish] Q12: MEDIA [Science] Q12: MEDIA [ESPN] Q12: MEDIA [Yahoo]
Exercise_W: What are the total counts and percentages for the gender column?
#Total counts for the gender column
##Total percentages for the gender column
print(candy['Q2: GENDER'].value_counts().sum())
print('========================')
print(candy['Q2: GENDER'].value_counts(normalize=True).sort_index(ascending = True))
2305 ======================== Female 0.363991 Male 0.636009 Name: Q2: GENDER, dtype: float64
Exercise_X: What are the breakdown of counts for the country column?
# Breakdown of counts for the country column
candy['Q4: COUNTRY'].value_counts()
USA 1955 CA 216 EU 73 Other 61 Name: Q4: COUNTRY, dtype: int64
Exercise_Y: How many total respondents voted joy in candy corn and how many voted despair? Did more people vote joy or despair for candy corn?
#Total vote respondents for JOY and DESPAIR in candy corn
#Total percentages of vote respondents
print(candy['Q6 | Candy Corn'].value_counts())
print('=================================')
print(candy['Q6 | Candy Corn'].value_counts(normalize=True).round(2))
DESPAIR 702 NO_ANSWER 620 MEH 529 JOY 454 Name: Q6 | Candy Corn, dtype: int64 ================================= DESPAIR 0.30 NO_ANSWER 0.27 MEH 0.23 JOY 0.20 Name: Q6 | Candy Corn, dtype: float64
#Total percentages of vote respondents for JOY and DESPAIR in candy corn
print(candy['Q6 | Candy Corn'].value_counts(normalize=True))
print('=================================================')
print('More people voted for DESPAIR in candy corn = 30% ')
DESPAIR 0.304555 NO_ANSWER 0.268980 MEH 0.229501 JOY 0.196963 Name: Q6 | Candy Corn, dtype: float64 ================================================= More people voted for DESPAIR in candy corn = 30%
Exercise_Z: How many people voted joy in Reese's Peanut Butter Cups? In Snickers? Did more people vote joy for Reese's Peanut Butter Cups or for Snickers?
# candy['Q6 | Reese\'s Peanut Butter Cups'].value_counts() #check Reese\'s Peanut Butter Cups spelling
candy['Q6 | Snickers'].value_counts()
JOY 1325 NO_ANSWER 633 MEH 273 DESPAIR 74 Name: Q6 | Snickers, dtype: int64
Exercise_AA: How many people voted joy in Twix? In Kit Kats? Did more people vote joy for Twix or for Kit Kats?
candy['Q6 | Twix'].value_counts()
JOY 1339 NO_ANSWER 630 MEH 269 DESPAIR 67 Name: Q6 | Twix, dtype: int64
candy['Q6 | Kit Kat'].value_counts()
JOY 1367 NO_ANSWER 616 MEH 275 DESPAIR 47 Name: Q6 | Kit Kat, dtype: int64
Exercise_AB: How many people voted joy in white bread? For whole wheat items? Did more people vote joy for white bread or whole wheat items?
print(candy['Q6 | White Bread'].value_counts())
DESPAIR 1376 NO_ANSWER 694 MEH 192 JOY 43 Name: Q6 | White Bread, dtype: int64
print(candy['Q6 | Whole Wheat anything'].value_counts())
print('===============================================')
print('More people voted joy on whole wheat items')
DESPAIR 1233 NO_ANSWER 684 MEH 278 JOY 110 Name: Q6 | Whole Wheat anything, dtype: int64 =============================================== More people voted joy on whole wheat items
Exercise_AC: How many people voted joy for Bonkers the board game? For Bonkers the candy? Did more people vote joy for the board game or for the candy?
candy['Q6 | Bonkers (the board game)'].value_counts()
NO_ANSWER 926 MEH 673 DESPAIR 518 JOY 188 Name: Q6 | Bonkers (the board game), dtype: int64
print(candy['Q6 | Bonkers (the candy)'].value_counts())
print('===============================================')
print('More people voted JOY for the board game than the candy = 188')
NO_ANSWER 911 MEH 818 DESPAIR 467 JOY 109 Name: Q6 | Bonkers (the candy), dtype: int64 =============================================== More people voted JOY for the board game than the candy = 188
Exercise_AD: How many people voted joy for a box of raisins? For the Blue-Ray DVD of the Real Housewives of Orange County Season 9? Did more people vote joy for a box of raisins or for the DVD?
# candy['Box Of Raisins'].value_counts()
candy['Q6 | Candy Corn'].value_counts()
DESPAIR 702 NO_ANSWER 620 MEH 529 JOY 454 Name: Q6 | Candy Corn, dtype: int64
Exercise_AE: What is the favorite day of the week for the respondents (both by total counts and percentages)?
# favorite day of the week by total counts
print(candy['Q11: DAY'].value_counts())
Friday 1026 NO_ANSWER 658 Sunday 621 Name: Q11: DAY, dtype: int64
# favorite day of the week by percentage
print(candy['Q11: DAY'].value_counts(normalize=True).round(2))
print('===============================================')
print('Friday is the favorite day of the week for the respondents, both by total counts = 1026, and percentages = 45% )')
Friday 0.45 NO_ANSWER 0.29 Sunday 0.27 Name: Q11: DAY, dtype: float64 =============================================== Friday is the favorite day of the week for the respondents, both by total counts = 1026, and percentages = 45% )
Exercise_AF: Do more respondents see 'white and gold' or 'blue and black' for the color of the dress (both total counts and percentages)?
# color of the dress (total counts)
print(candy['Q10: DRESS'].value_counts())
White and gold 1027 NO_ANSWER 679 Blue and black 599 Name: Q10: DRESS, dtype: int64
# color of the dress (percentages)
print(candy['Q10: DRESS'].value_counts(normalize=True).round(2))
print('===========================================================================')
print('More respondents see White and gold for the color of the dress (1027 = 45%)')
White and gold 0.45 NO_ANSWER 0.29 Blue and black 0.26 Name: Q10: DRESS, dtype: float64 =========================================================================== More respondents see White and gold for the color of the dress (1027 = 45%)
Exercise_AG: For those respondents that clicked on the media link (listed as Q12
columns on the survey), which link did they click on the most?
print(candy['Q12: MEDIA [Yahoo]'].value_counts())
0.0 2244 1.0 61 Name: Q12: MEDIA [Yahoo], dtype: int64
Exercise_AH: How many males voted joy for receiving a copy of the Real Housewives of Orange County Season 9 DVD for Halloween? Females? Those 17 or younger?
# number of Males and Females voted joy
print(candy[['Q2: GENDER', 'Q6 | Real Housewives of Orange County Season 9 Blue-Ray']].value_counts())
print('====================================')
print('45 males and 41 Females voted joy')
Q2: GENDER Q6 | Real Housewives of Orange County Season 9 Blue-Ray Male DESPAIR 901 Female DESPAIR 497 Male NO_ANSWER 439 Female NO_ANSWER 253 Male MEH 81 Female MEH 48 Male JOY 45 Female JOY 41 dtype: int64 ==================================== 45 males and 41 Females voted joy
# number of respondents 17 or younger who voted joy
print(candy[['Q3: AGE','Q6 | Real Housewives of Orange County Season 9 Blue-Ray']].value_counts())
print('====================================')
print('2 respondents 17 or younger voted joy')
Q3: AGE Q6 | Real Housewives of Orange County Season 9 Blue-Ray 36-45 DESPAIR 482 46-55 DESPAIR 377 26-35 DESPAIR 273 36-45 NO_ANSWER 224 26-35 NO_ANSWER 187 56+ DESPAIR 177 46-55 NO_ANSWER 109 56+ NO_ANSWER 91 36-45 MEH 38 18-25 NO_ANSWER 35 DESPAIR 35 26-35 MEH 33 unknown NO_ANSWER 29 26-35 JOY 27 17 and under DESPAIR 27 unknown DESPAIR 27 46-55 MEH 26 36-45 JOY 24 56+ MEH 18 17 and under NO_ANSWER 17 46-55 JOY 13 56+ JOY 12 18-25 MEH 10 JOY 5 unknown JOY 3 17 and under MEH 3 JOY 2 unknown MEH 1 dtype: int64 ==================================== 2 respondents 17 or younger voted joy
Exercise_AI: The authors tried a funny way to determine a respondent's political leaning. Note this was purely a joke and not meant to be scientific. How many total respondents voted joy in the following: Blue M&M's, Red M&M's, Green Party M&M's, Independent M&M's, and Abstained from M&M'ing.
## number of total respondents voted joy in the following:
#Blue M&M's, Red M&M's, Green Party M&M's, Independent M&M's, and Abstained from M&M'ing
print(pd.DataFrame(candy, columns = ['Q6 | Blue M&M\'s']).value_counts())
print('==============================================')
print('963 total respondents voted joy in Blue M&M\'s')
print('==============================================')
print(pd.DataFrame(candy, columns = ['Q6 | Red M&M\'s']).value_counts())
print('==============================================')
print('949 total respondents voted joy in Red M&M\'s')
print('==============================================')
print(pd.DataFrame(candy, columns = ['Q6 | Green Party M&M\'s']).value_counts())
print('=====================================================')
print('915 total respondents voted joy in Green Party M&M\'s')
print('=====================================================')
print(pd.DataFrame(candy, columns = ['Q6 | Independent M&M\'s']).value_counts())
print('======================================================')
print('711 total respondents voted joy in Independent M&M\'s')
print('======================================================')
print(pd.DataFrame(candy, columns = ['Q6 | Abstained from M&M\'ing.']).value_counts())
print('==========================================================')
print('202 total respondents voted joy in Abstained from M&M\'ing.')
Q6 | Blue M&M's JOY 963 NO_ANSWER 667 MEH 563 DESPAIR 112 dtype: int64 ============================================== 963 total respondents voted joy in Blue M&M's ============================================== Q6 | Red M&M's JOY 949 NO_ANSWER 669 MEH 574 DESPAIR 113 dtype: int64 ============================================== 949 total respondents voted joy in Red M&M's ============================================== Q6 | Green Party M&M's JOY 915 NO_ANSWER 701 MEH 561 DESPAIR 128 dtype: int64 ===================================================== 915 total respondents voted joy in Green Party M&M's ===================================================== Q6 | Independent M&M's NO_ANSWER 749 JOY 711 MEH 592 DESPAIR 253 dtype: int64 ====================================================== 711 total respondents voted joy in Independent M&M's ====================================================== Q6 | Abstained from M&M'ing. NO_ANSWER 868 DESPAIR 662 MEH 573 JOY 202 dtype: int64 ========================================================== 202 total respondents voted joy in Abstained from M&M'ing.
# We can print this for a list of column names
for col in candy.columns:
print(col)
Q1: GOING OUT? Q2: GENDER Q3: AGE Q4: COUNTRY Q6 | 100 Grand Bar Q6 | Anonymous brown globs that come in black and orange wrappers (a.k.a. Mary Janes) Q6 | Any full-sized candy bar Q6 | Black Jacks Q6 | Bonkers (the candy) Q6 | Bonkers (the board game) Q6 | Bottle Caps Q6 | Box'o'Raisins Q6 | Broken glow stick Q6 | Butterfinger Q6 | Cadbury Creme Eggs Q6 | Candy Corn Q6 | Candy that is clearly just the stuff given out for free at restaurants Q6 | Caramellos Q6 | Cash, or other forms of legal tender Q6 | Chardonnay Q6 | Chick-o-Sticks (we donÕt know what that is) Q6 | Chiclets Q6 | Coffee Crisp Q6 | Creepy Religious comics/Chick Tracts Q6 | Dental paraphenalia Q6 | Dots Q6 | Dove Bars Q6 | Fuzzy Peaches Q6 | Generic Brand Acetaminophen Q6 | Glow sticks Q6 | Goo Goo Clusters Q6 | Good N' Plenty Q6 | Gum from baseball cards Q6 | Gummy Bears straight up Q6 | Hard Candy Q6 | Healthy Fruit Q6 | Heath Bar Q6 | Hershey's Dark Chocolate Q6_new | Hershey's Milk Chocolate Q6 | Hershey's Kisses Q6 | Hugs (actual physical hugs) Q6 | Jolly Rancher (bad flavor) Q6 | Jolly Ranchers (good flavor) Q6 | JoyJoy (Mit Iodine!) Q6 | Junior Mints Q6 | Senior Mints Q6 | Kale smoothie Q6 | Kinder Happy Hippo Q6 | Kit Kat Q6 | LaffyTaffy Q6 | LemonHeads Q6 | Licorice (not black) Q6 | Licorice (yes black) Q6 | Lindt Truffle Q6 | Lollipops Q6 | Mars Q6 | Maynards Q6 | Mike and Ike Q6 | Milk Duds Q6 | Milky Way Q6 | Regular M&Ms Q6 | Peanut M&MÕs Q6 | Blue M&M's Q6 | Red M&M's Q6 | Green Party M&M's Q6 | Independent M&M's Q6 | Abstained from M&M'ing. Q6 | Minibags of chips Q6 | Mint Kisses Q6 | Mint Juleps Q6 | Mr. Goodbar Q6 | Necco Wafers Q6 | Nerds Q6 | Nestle Crunch Q6 | Now'n'Laters Q6 | Peeps Q6 | Pencils Q6 | Pixy Stix Q6 | Real Housewives of Orange County Season 9 Blue-Ray Q6 | ReeseÕs Peanut Butter Cups Q6 | Reese's Pieces Q6 | Reggie Jackson Bar Q6 | Rolos Q6 | Sandwich-sized bags filled with BooBerry Crunch Q6 | Skittles Q6 | Smarties (American) Q6 | Smarties (Commonwealth) Q6 | Snickers Q6 | Sourpatch Kids (i.e. abominations of nature) Q6 | Spotted Dick Q6 | Starburst Q6 | Sweet Tarts Q6 | Swedish Fish Q6 | Sweetums (a friend to diabetes) Q6 | Take 5 Q6 | Tic Tacs Q6 | Those odd marshmallow circus peanut things Q6 | Three Musketeers Q6 | Tolberone something or other Q6 | Trail Mix Q6 | Twix Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein Q6 | Vicodin Q6 | Whatchamacallit Bars Q6 | White Bread Q6 | Whole Wheat anything Q6 | York Peppermint Patties Q10: DRESS Q11: DAY Q12: MEDIA [Daily Dish] Q12: MEDIA [Science] Q12: MEDIA [ESPN] Q12: MEDIA [Yahoo]
Exercise_AJ: Select only the Q6 candy columns (Q6 | 100 Grand Bar
through Q6 | York Peppermint Patties
) in the data set and save this as a new DataFrame called candy_reduced
.
#Select Q6 candy columns (Q6 | 100 Grand Bar through Q6 | York Peppermint Patties)
# And save this as a new DataFrame called candy_reduced
candy_reduced = candy.loc[:, 'Q6 | 100 Grand Bar':'Q6 | York Peppermint Patties']
candy_reduced
Q6 | 100 Grand Bar | Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) | Q6 | Any full-sized candy bar | Q6 | Black Jacks | Q6 | Bonkers (the candy) | Q6 | Bonkers (the board game) | Q6 | Bottle Caps | Q6 | Box'o'Raisins | Q6 | Broken glow stick | Q6 | Butterfinger | ... | Q6 | Three Musketeers | Q6 | Tolberone something or other | Q6 | Trail Mix | Q6 | Twix | Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein | Q6 | Vicodin | Q6 | Whatchamacallit Bars | Q6 | White Bread | Q6 | Whole Wheat anything | Q6 | York Peppermint Patties | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MEH | DESPAIR | JOY | MEH | DESPAIR | DESPAIR | DESPAIR | DESPAIR | DESPAIR | DESPAIR | ... | JOY | JOY | DESPAIR | JOY | DESPAIR | DESPAIR | DESPAIR | DESPAIR | DESPAIR | DESPAIR |
1 | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | ... | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER |
2 | MEH | DESPAIR | JOY | MEH | MEH | DESPAIR | MEH | DESPAIR | DESPAIR | MEH | ... | DESPAIR | JOY | MEH | JOY | DESPAIR | JOY | JOY | DESPAIR | DESPAIR | DESPAIR |
3 | JOY | DESPAIR | JOY | DESPAIR | MEH | DESPAIR | MEH | DESPAIR | DESPAIR | MEH | ... | JOY | JOY | DESPAIR | JOY | MEH | JOY | JOY | DESPAIR | DESPAIR | JOY |
4 | JOY | DESPAIR | JOY | NO_ANSWER | NO_ANSWER | NO_ANSWER | MEH | MEH | DESPAIR | JOY | ... | JOY | JOY | MEH | JOY | DESPAIR | DESPAIR | JOY | DESPAIR | DESPAIR | JOY |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2300 | JOY | DESPAIR | MEH | DESPAIR | DESPAIR | MEH | MEH | DESPAIR | DESPAIR | MEH | ... | MEH | MEH | JOY | JOY | MEH | JOY | DESPAIR | MEH | DESPAIR | MEH |
2301 | MEH | DESPAIR | JOY | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | DESPAIR | DESPAIR | JOY | ... | MEH | MEH | DESPAIR | JOY | NO_ANSWER | NO_ANSWER | JOY | DESPAIR | MEH | JOY |
2302 | MEH | DESPAIR | JOY | DESPAIR | MEH | JOY | DESPAIR | MEH | MEH | DESPAIR | ... | JOY | JOY | MEH | MEH | MEH | JOY | MEH | DESPAIR | DESPAIR | MEH |
2303 | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | ... | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER |
2304 | DESPAIR | DESPAIR | JOY | DESPAIR | DESPAIR | DESPAIR | DESPAIR | MEH | DESPAIR | JOY | ... | JOY | MEH | DESPAIR | DESPAIR | DESPAIR | JOY | DESPAIR | MEH | DESPAIR | JOY |
2305 rows × 103 columns
Exercise_AK: Determine what candy/item from the candy_reduced
DataFrame has the most number of JOY votes and the least number of JOY votes. A simple way to do this is to filter the entire DataFrame for any JOY
values, then use count()
, then sort the values in descending order. See this stackoverflow question and answers.
candy_reduced[candy_reduced == 'JOY'].count().sort_values(ascending=False)
Q6 | Any full-sized candy bar 1477 Q6 | ReeseÕs Peanut Butter Cups 1416 Q6 | Kit Kat 1367 Q6 | Cash, or other forms of legal tender 1363 Q6 | Twix 1339 ... Q6 | JoyJoy (Mit Iodine!) 72 Q6 | Gum from baseball cards 43 Q6 | White Bread 43 Q6 | Candy that is clearly just the stuff given out for free at restaurants 37 Q6 | Broken glow stick 24 Length: 103, dtype: int64
Exercise_AL: Using the above as an example, what candy/item has the most DESPAIR votes?
candy_reduced[candy_reduced == 'DESPAIR'].count().sort_values(ascending=False)
Q6 | Broken glow stick 1535 Q6 | Real Housewives of Orange County Season 9 Blue-Ray 1398 Q6 | Gum from baseball cards 1386 Q6 | White Bread 1376 Q6 | Kale smoothie 1365 ... Q6 | Regular M&Ms 71 Q6 | Twix 67 Q6 | Cash, or other forms of legal tender 63 Q6 | Kit Kat 47 Q6 | Any full-sized candy bar 15 Length: 103, dtype: int64
Exercise_AM: What candy/item has the most MEH votes?
candy_reduced[candy_reduced == 'MEH'].count().sort_values(ascending=False)
Q6 | Lollipops 877 Q6 | Hard Candy 855 Q6 | Bonkers (the candy) 818 Q6 | Minibags of chips 718 Q6 | 100 Grand Bar 715 ... Q6 | ReeseÕs Peanut Butter Cups 178 Q6 | Kale smoothie 162 Q6 | Real Housewives of Orange County Season 9 Blue-Ray 129 Q6 | Broken glow stick 99 Q6 | Creepy Religious comics/Chick Tracts 95 Length: 103, dtype: int64
Exercise_AN: What candy/item did the most people not recognize or have an opinion on? (According to the survey, the respondents were asked to leave a question blank if they did not know the item)
candy_reduced[candy_reduced == 'NO_ANSWER'].count().sort_values(ascending=False)
Q6 | JoyJoy (Mit Iodine!) 942 Q6 | Maynards 939 Q6 | Reggie Jackson Bar 933 Q6 | Bonkers (the board game) 926 Q6 | Sweetums (a friend to diabetes) 924 ... Q6 | Kit Kat 616 Q6 | Any full-sized candy bar 615 Q6 | Hershey's Dark Chocolate 615 Q6_new | Hershey's Milk Chocolate 614 Q6 | Peanut M&MÕs 614 Length: 103, dtype: int64
In the final piece of the analysis, we will determine what candy/items have the most and lowest "net_feelies" (calculated by the authors as the total joy count minus the total despair count).
First, we will create two Series, one with JOY counts and one with DESPAIR counts to add to our candy_reduced
data.
Exercise_AO: Create a Series called joy_count
that lists total counts for JOY for each column, making sure to keep it in the same order as the columns in the candy_reduced
DataFrame. Hint: This should be almost exactly how we determined which candy/items had the most JOY votes, but we would not do any sorting.
joy_count = pd.Series(candy_reduced[candy_reduced == 'JOY'].count())
joy_count
# candy_reduced[candy_reduced == 'JOY'].count()
Q6 | 100 Grand Bar 828 Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) 165 Q6 | Any full-sized candy bar 1477 Q6 | Black Jacks 87 Q6 | Bonkers (the candy) 109 ... Q6 | Vicodin 674 Q6 | Whatchamacallit Bars 802 Q6 | White Bread 43 Q6 | Whole Wheat anything 110 Q6 | York Peppermint Patties 1040 Length: 103, dtype: int64
Exercise_AP: Same as above except you will create a Series called despair_count
that lists the total counts for DESPAIR for each column.
despair_count = pd.Series(candy_reduced[candy_reduced == 'DESPAIR'].count())
despair_count
Q6 | 100 Grand Bar 80 Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) 1017 Q6 | Any full-sized candy bar 15 Q6 | Black Jacks 754 Q6 | Bonkers (the candy) 467 ... Q6 | Vicodin 679 Q6 | Whatchamacallit Bars 273 Q6 | White Bread 1376 Q6 | Whole Wheat anything 1233 Q6 | York Peppermint Patties 222 Length: 103, dtype: int64
Exercise_AQ: Take the transpose of the candy_reduced
DataFrame and save this transposed data as candy_reduced_transpose
.
candy_reduced_transpose = candy_reduced.T
candy_reduced_transpose = candy_reduced_transpose.reset_index()
candy_reduced_transpose
index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 2295 | 2296 | 2297 | 2298 | 2299 | 2300 | 2301 | 2302 | 2303 | 2304 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Q6 | 100 Grand Bar | MEH | NO_ANSWER | MEH | JOY | JOY | NO_ANSWER | JOY | JOY | MEH | ... | NO_ANSWER | NO_ANSWER | JOY | MEH | JOY | JOY | MEH | MEH | NO_ANSWER | DESPAIR |
1 | Q6 | Anonymous brown globs that come in black ... | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | ... | NO_ANSWER | NO_ANSWER | DESPAIR | DESPAIR | MEH | DESPAIR | DESPAIR | DESPAIR | NO_ANSWER | DESPAIR |
2 | Q6 | Any full-sized candy bar | JOY | NO_ANSWER | JOY | JOY | JOY | NO_ANSWER | JOY | JOY | JOY | ... | NO_ANSWER | NO_ANSWER | JOY | JOY | JOY | MEH | JOY | JOY | NO_ANSWER | JOY |
3 | Q6 | Black Jacks | MEH | NO_ANSWER | MEH | DESPAIR | NO_ANSWER | NO_ANSWER | DESPAIR | MEH | MEH | ... | NO_ANSWER | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | DESPAIR | NO_ANSWER | DESPAIR | NO_ANSWER | DESPAIR |
4 | Q6 | Bonkers (the candy) | DESPAIR | NO_ANSWER | MEH | MEH | NO_ANSWER | NO_ANSWER | DESPAIR | MEH | MEH | ... | NO_ANSWER | NO_ANSWER | MEH | MEH | MEH | DESPAIR | NO_ANSWER | MEH | NO_ANSWER | DESPAIR |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
98 | Q6 | Vicodin | DESPAIR | NO_ANSWER | JOY | JOY | DESPAIR | NO_ANSWER | MEH | DESPAIR | DESPAIR | ... | NO_ANSWER | NO_ANSWER | JOY | DESPAIR | MEH | JOY | NO_ANSWER | JOY | NO_ANSWER | JOY |
99 | Q6 | Whatchamacallit Bars | DESPAIR | NO_ANSWER | JOY | JOY | JOY | NO_ANSWER | DESPAIR | MEH | MEH | ... | NO_ANSWER | NO_ANSWER | MEH | NO_ANSWER | JOY | DESPAIR | JOY | MEH | NO_ANSWER | DESPAIR |
100 | Q6 | White Bread | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | ... | NO_ANSWER | NO_ANSWER | DESPAIR | DESPAIR | MEH | MEH | DESPAIR | DESPAIR | NO_ANSWER | MEH |
101 | Q6 | Whole Wheat anything | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | ... | NO_ANSWER | NO_ANSWER | MEH | DESPAIR | MEH | DESPAIR | MEH | DESPAIR | NO_ANSWER | DESPAIR |
102 | Q6 | York Peppermint Patties | DESPAIR | NO_ANSWER | DESPAIR | JOY | JOY | NO_ANSWER | DESPAIR | DESPAIR | MEH | ... | NO_ANSWER | NO_ANSWER | MEH | JOY | JOY | MEH | JOY | MEH | NO_ANSWER | JOY |
103 rows × 2306 columns
Exercise_AR: Add a new column called "joy_count" using the joy_count
Series above and a new column called 'despair_count" using the despair_count
Series above to the candy_reduced_transpose
DataFrame.
## Add a new column called "joy_count" using the `joy_count` Series above to the candy_reduced_transpose DataFrame
candy_reduced_transpose['joy_count'] = candy_reduced_transpose['index'].map(joy_count)
candy_reduced_transpose
index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 2296 | 2297 | 2298 | 2299 | 2300 | 2301 | 2302 | 2303 | 2304 | joy_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Q6 | 100 Grand Bar | MEH | NO_ANSWER | MEH | JOY | JOY | NO_ANSWER | JOY | JOY | MEH | ... | NO_ANSWER | JOY | MEH | JOY | JOY | MEH | MEH | NO_ANSWER | DESPAIR | 828 |
1 | Q6 | Anonymous brown globs that come in black ... | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | ... | NO_ANSWER | DESPAIR | DESPAIR | MEH | DESPAIR | DESPAIR | DESPAIR | NO_ANSWER | DESPAIR | 165 |
2 | Q6 | Any full-sized candy bar | JOY | NO_ANSWER | JOY | JOY | JOY | NO_ANSWER | JOY | JOY | JOY | ... | NO_ANSWER | JOY | JOY | JOY | MEH | JOY | JOY | NO_ANSWER | JOY | 1477 |
3 | Q6 | Black Jacks | MEH | NO_ANSWER | MEH | DESPAIR | NO_ANSWER | NO_ANSWER | DESPAIR | MEH | MEH | ... | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | DESPAIR | NO_ANSWER | DESPAIR | NO_ANSWER | DESPAIR | 87 |
4 | Q6 | Bonkers (the candy) | DESPAIR | NO_ANSWER | MEH | MEH | NO_ANSWER | NO_ANSWER | DESPAIR | MEH | MEH | ... | NO_ANSWER | MEH | MEH | MEH | DESPAIR | NO_ANSWER | MEH | NO_ANSWER | DESPAIR | 109 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
98 | Q6 | Vicodin | DESPAIR | NO_ANSWER | JOY | JOY | DESPAIR | NO_ANSWER | MEH | DESPAIR | DESPAIR | ... | NO_ANSWER | JOY | DESPAIR | MEH | JOY | NO_ANSWER | JOY | NO_ANSWER | JOY | 674 |
99 | Q6 | Whatchamacallit Bars | DESPAIR | NO_ANSWER | JOY | JOY | JOY | NO_ANSWER | DESPAIR | MEH | MEH | ... | NO_ANSWER | MEH | NO_ANSWER | JOY | DESPAIR | JOY | MEH | NO_ANSWER | DESPAIR | 802 |
100 | Q6 | White Bread | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | ... | NO_ANSWER | DESPAIR | DESPAIR | MEH | MEH | DESPAIR | DESPAIR | NO_ANSWER | MEH | 43 |
101 | Q6 | Whole Wheat anything | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | ... | NO_ANSWER | MEH | DESPAIR | MEH | DESPAIR | MEH | DESPAIR | NO_ANSWER | DESPAIR | 110 |
102 | Q6 | York Peppermint Patties | DESPAIR | NO_ANSWER | DESPAIR | JOY | JOY | NO_ANSWER | DESPAIR | DESPAIR | MEH | ... | NO_ANSWER | MEH | JOY | JOY | MEH | JOY | MEH | NO_ANSWER | JOY | 1040 |
103 rows × 2307 columns
#check joy_count sum
joy_count.sum()
60771
## Add a new column called 'despair_count" using the despair_count Series above to the candy_reduced_transpose DataFrame.
candy_reduced_transpose['despair_count'] = candy_reduced_transpose['index'].map(despair_count)
candy_reduced_transpose
index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 2297 | 2298 | 2299 | 2300 | 2301 | 2302 | 2303 | 2304 | joy_count | despair_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Q6 | 100 Grand Bar | MEH | NO_ANSWER | MEH | JOY | JOY | NO_ANSWER | JOY | JOY | MEH | ... | JOY | MEH | JOY | JOY | MEH | MEH | NO_ANSWER | DESPAIR | 828 | 80 |
1 | Q6 | Anonymous brown globs that come in black ... | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | ... | DESPAIR | DESPAIR | MEH | DESPAIR | DESPAIR | DESPAIR | NO_ANSWER | DESPAIR | 165 | 1017 |
2 | Q6 | Any full-sized candy bar | JOY | NO_ANSWER | JOY | JOY | JOY | NO_ANSWER | JOY | JOY | JOY | ... | JOY | JOY | JOY | MEH | JOY | JOY | NO_ANSWER | JOY | 1477 | 15 |
3 | Q6 | Black Jacks | MEH | NO_ANSWER | MEH | DESPAIR | NO_ANSWER | NO_ANSWER | DESPAIR | MEH | MEH | ... | DESPAIR | DESPAIR | DESPAIR | DESPAIR | NO_ANSWER | DESPAIR | NO_ANSWER | DESPAIR | 87 | 754 |
4 | Q6 | Bonkers (the candy) | DESPAIR | NO_ANSWER | MEH | MEH | NO_ANSWER | NO_ANSWER | DESPAIR | MEH | MEH | ... | MEH | MEH | MEH | DESPAIR | NO_ANSWER | MEH | NO_ANSWER | DESPAIR | 109 | 467 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
98 | Q6 | Vicodin | DESPAIR | NO_ANSWER | JOY | JOY | DESPAIR | NO_ANSWER | MEH | DESPAIR | DESPAIR | ... | JOY | DESPAIR | MEH | JOY | NO_ANSWER | JOY | NO_ANSWER | JOY | 674 | 679 |
99 | Q6 | Whatchamacallit Bars | DESPAIR | NO_ANSWER | JOY | JOY | JOY | NO_ANSWER | DESPAIR | MEH | MEH | ... | MEH | NO_ANSWER | JOY | DESPAIR | JOY | MEH | NO_ANSWER | DESPAIR | 802 | 273 |
100 | Q6 | White Bread | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | ... | DESPAIR | DESPAIR | MEH | MEH | DESPAIR | DESPAIR | NO_ANSWER | MEH | 43 | 1376 |
101 | Q6 | Whole Wheat anything | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | ... | MEH | DESPAIR | MEH | DESPAIR | MEH | DESPAIR | NO_ANSWER | DESPAIR | 110 | 1233 |
102 | Q6 | York Peppermint Patties | DESPAIR | NO_ANSWER | DESPAIR | JOY | JOY | NO_ANSWER | DESPAIR | DESPAIR | MEH | ... | MEH | JOY | JOY | MEH | JOY | MEH | NO_ANSWER | JOY | 1040 | 222 |
103 rows × 2308 columns
#check despair_count sum
despair_count.sum()
55169
Exercise_AS: Add a new column to the candy_reduced_transpose
DataFrame called "net_feelies" that takes the joy_count
column and subtracts the despair_count
column.
# Add a new column to the candy_reduced_transpose DataFrame called "net_feelies"
# that takes the joy_count column and subtracts the despair_count column
candy_reduced_transpose['net_feelies'] = candy_reduced_transpose['joy_count']-candy_reduced_transpose['despair_count'] ### ENTER CODE HERE ###
candy_reduced_transpose
index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 2298 | 2299 | 2300 | 2301 | 2302 | 2303 | 2304 | joy_count | despair_count | net_feelies | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Q6 | 100 Grand Bar | MEH | NO_ANSWER | MEH | JOY | JOY | NO_ANSWER | JOY | JOY | MEH | ... | MEH | JOY | JOY | MEH | MEH | NO_ANSWER | DESPAIR | 828 | 80 | 748 |
1 | Q6 | Anonymous brown globs that come in black ... | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | ... | DESPAIR | MEH | DESPAIR | DESPAIR | DESPAIR | NO_ANSWER | DESPAIR | 165 | 1017 | -852 |
2 | Q6 | Any full-sized candy bar | JOY | NO_ANSWER | JOY | JOY | JOY | NO_ANSWER | JOY | JOY | JOY | ... | JOY | JOY | MEH | JOY | JOY | NO_ANSWER | JOY | 1477 | 15 | 1462 |
3 | Q6 | Black Jacks | MEH | NO_ANSWER | MEH | DESPAIR | NO_ANSWER | NO_ANSWER | DESPAIR | MEH | MEH | ... | DESPAIR | DESPAIR | DESPAIR | NO_ANSWER | DESPAIR | NO_ANSWER | DESPAIR | 87 | 754 | -667 |
4 | Q6 | Bonkers (the candy) | DESPAIR | NO_ANSWER | MEH | MEH | NO_ANSWER | NO_ANSWER | DESPAIR | MEH | MEH | ... | MEH | MEH | DESPAIR | NO_ANSWER | MEH | NO_ANSWER | DESPAIR | 109 | 467 | -358 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
98 | Q6 | Vicodin | DESPAIR | NO_ANSWER | JOY | JOY | DESPAIR | NO_ANSWER | MEH | DESPAIR | DESPAIR | ... | DESPAIR | MEH | JOY | NO_ANSWER | JOY | NO_ANSWER | JOY | 674 | 679 | -5 |
99 | Q6 | Whatchamacallit Bars | DESPAIR | NO_ANSWER | JOY | JOY | JOY | NO_ANSWER | DESPAIR | MEH | MEH | ... | NO_ANSWER | JOY | DESPAIR | JOY | MEH | NO_ANSWER | DESPAIR | 802 | 273 | 529 |
100 | Q6 | White Bread | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | ... | DESPAIR | MEH | MEH | DESPAIR | DESPAIR | NO_ANSWER | MEH | 43 | 1376 | -1333 |
101 | Q6 | Whole Wheat anything | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | NO_ANSWER | DESPAIR | DESPAIR | DESPAIR | ... | DESPAIR | MEH | DESPAIR | MEH | DESPAIR | NO_ANSWER | DESPAIR | 110 | 1233 | -1123 |
102 | Q6 | York Peppermint Patties | DESPAIR | NO_ANSWER | DESPAIR | JOY | JOY | NO_ANSWER | DESPAIR | DESPAIR | MEH | ... | JOY | JOY | MEH | JOY | MEH | NO_ANSWER | JOY | 1040 | 222 | 818 |
103 rows × 2309 columns
## net_feelies sum check
# np.random.seed(56)
# count = np.random.uniform(low = -57, high = 9, size = (len(candy_reduced_transpose))).round(0)
# net_feelies = count
# net_feelies.sum()
Exercise_AT: Select only the joy_count
, despair_count
, and net_feelies
columns from the candy_reduced_transpose
DataFrame. Sort this DataFrame in descending order by net_feelies
and save this as candy_net_sorted
.
### This needs to be corrected! ###
candy_net_sorted = candy_reduced_transpose[['joy_count', 'despair_count', 'net_feelies']]#.sort_values(by='net_feelies', ascending=False)
candy_net_sorted.sort_values(by='net_feelies', ascending=False)
joy_count | despair_count | net_feelies | |
---|---|---|---|
2 | 1477 | 15 | 1462 |
75 | 1416 | 88 | 1328 |
44 | 1367 | 47 | 1320 |
14 | 1363 | 63 | 1300 |
96 | 1339 | 67 | 1272 |
... | ... | ... | ... |
20 | 84 | 1356 | -1272 |
74 | 86 | 1398 | -1312 |
100 | 43 | 1376 | -1333 |
28 | 43 | 1386 | -1343 |
8 | 24 | 1535 | -1511 |
103 rows × 3 columns
###### net_feelies sum check ######
## This need correction. Answer should be 5602, not -2523 ##
candy_net_sorted['net_feelies'].sum()
5602
Be prepared to answer what candy/item had the most and least net_feelies
values.
We now want to get the candy
DataFrame ready to run a machine learning algorthim to determine if we could predict a person's gender based on what candy they prefer.
You will learn more about this in the machine learning classes, but some algorithms work exclusively with numeric values. We will now turn all of our values into numeric values. There are easier ways to do this with sklearn, which you will study in later courses, but we will use Pandas to perform these exercises for further practice.
Exercise_AU: For grading purposes, we want to leave the candy
DataFrame as is. Make a copy of the candy
DataFrame and save this new DataFrame as candy_encode
.
candy_encode = candy.copy()
candy_encode.shape
(2305, 113)
Exercise_AV: For the candy_encode
DataFrame, replace any Female
values with 0
and any Male
values with 1
.
# For the candy_encode DataFrame, replace any Female values with 0 and any Male values with 1
candy_encode['Q2: GENDER'].replace(['Female','Male'],[0,1], inplace=True)
candy_encode
Q1: GOING OUT? | Q2: GENDER | Q3: AGE | Q4: COUNTRY | Q6 | 100 Grand Bar | Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) | Q6 | Any full-sized candy bar | Q6 | Black Jacks | Q6 | Bonkers (the candy) | Q6 | Bonkers (the board game) | ... | Q6 | Whatchamacallit Bars | Q6 | White Bread | Q6 | Whole Wheat anything | Q6 | York Peppermint Patties | Q10: DRESS | Q11: DAY | Q12: MEDIA [Daily Dish] | Q12: MEDIA [Science] | Q12: MEDIA [ESPN] | Q12: MEDIA [Yahoo] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | No | 1 | 36-45 | USA | MEH | DESPAIR | JOY | MEH | DESPAIR | DESPAIR | ... | DESPAIR | DESPAIR | DESPAIR | DESPAIR | White and gold | Sunday | 0.0 | 1.0 | 0.0 | 0.0 |
1 | No | 1 | 46-55 | USA | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | ... | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | 0.0 | 0.0 | 0.0 | 0.0 |
2 | No | 1 | 36-45 | USA | MEH | DESPAIR | JOY | MEH | MEH | DESPAIR | ... | JOY | DESPAIR | DESPAIR | DESPAIR | White and gold | Sunday | 0.0 | 1.0 | 0.0 | 0.0 |
3 | No | 1 | 18-25 | USA | JOY | DESPAIR | JOY | DESPAIR | MEH | DESPAIR | ... | JOY | DESPAIR | DESPAIR | JOY | White and gold | Friday | 0.0 | 1.0 | 0.0 | 0.0 |
4 | No | 1 | unknown | Other | JOY | DESPAIR | JOY | NO_ANSWER | NO_ANSWER | NO_ANSWER | ... | JOY | DESPAIR | DESPAIR | JOY | NO_ANSWER | NO_ANSWER | 0.0 | 1.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2300 | No | 1 | 18-25 | USA | JOY | DESPAIR | MEH | DESPAIR | DESPAIR | MEH | ... | DESPAIR | MEH | DESPAIR | MEH | White and gold | Friday | 0.0 | 0.0 | 0.0 | 0.0 |
2301 | No | 0 | 26-35 | USA | MEH | DESPAIR | JOY | NO_ANSWER | NO_ANSWER | NO_ANSWER | ... | JOY | DESPAIR | MEH | JOY | Blue and black | Friday | 0.0 | 1.0 | 0.0 | 0.0 |
2302 | No | 0 | 26-35 | USA | MEH | DESPAIR | JOY | DESPAIR | MEH | JOY | ... | MEH | DESPAIR | DESPAIR | MEH | Blue and black | Friday | 0.0 | 1.0 | 0.0 | 0.0 |
2303 | No | 1 | 56+ | USA | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | ... | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | 0.0 | 0.0 | 0.0 | 0.0 |
2304 | No | 0 | 56+ | USA | DESPAIR | DESPAIR | JOY | DESPAIR | DESPAIR | DESPAIR | ... | DESPAIR | MEH | DESPAIR | JOY | White and gold | Sunday | 1.0 | 0.0 | 0.0 | 0.0 |
2305 rows × 113 columns
Exercise_AW: Again, you will learn more about this later, but we need to separate the column that we want to predict (called the response) and the columns that we will use to make the predictions (called the features). For both of the items below, make sure that the index is reset and goes from 0 to n-1.
Q2: GENDER
column from candy_encode
and save this as candy_response
. Note: This should be a Series.candy_encode
DataFrame: Q2: GENDER
,Q1: GOING OUT?
,Q3: AGE
,Q4: COUNTRY
,Q10: DRESS
,Q11: DAY
, Q12: MEDIA [Daily Dish]
,Q12: MEDIA [Science]
,Q12: MEDIA [ESPN]
,Q12: MEDIA [Yahoo]
. Save the remaining columns as candy_features
.# Select only the Q2: GENDER column from candy_encode and save this as candy_response(This should be a Series)
candy_response = candy_encode['Q2: GENDER']
candy_response
0 1 1 1 2 1 3 1 4 1 .. 2300 1 2301 0 2302 0 2303 1 2304 0 Name: Q2: GENDER, Length: 2305, dtype: int64
### Drop the following columns from the candy_encode DataFrame:
# Q1: GOING OUT?,Q2: GENDER, Q3: AGE,Q4: COUNTRY,Q10: DRESS,Q11: DAY, Q12: MEDIA [Daily Dish],
# Q12: MEDIA [Science],Q12: MEDIA [ESPN],Q12: MEDIA [Yahoo].
# Save the remaining columns as candy_features
candy_features = candy_encode.drop(columns = ['Q1: GOING OUT?','Q2: GENDER', 'Q3: AGE', 'Q4: COUNTRY',
'Q10: DRESS', 'Q11: DAY', 'Q12: MEDIA [Daily Dish]',
'Q12: MEDIA [Science]', 'Q12: MEDIA [ESPN]',
'Q12: MEDIA [Yahoo]'], axis =1)
candy_features
Q6 | 100 Grand Bar | Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) | Q6 | Any full-sized candy bar | Q6 | Black Jacks | Q6 | Bonkers (the candy) | Q6 | Bonkers (the board game) | Q6 | Bottle Caps | Q6 | Box'o'Raisins | Q6 | Broken glow stick | Q6 | Butterfinger | ... | Q6 | Three Musketeers | Q6 | Tolberone something or other | Q6 | Trail Mix | Q6 | Twix | Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein | Q6 | Vicodin | Q6 | Whatchamacallit Bars | Q6 | White Bread | Q6 | Whole Wheat anything | Q6 | York Peppermint Patties | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MEH | DESPAIR | JOY | MEH | DESPAIR | DESPAIR | DESPAIR | DESPAIR | DESPAIR | DESPAIR | ... | JOY | JOY | DESPAIR | JOY | DESPAIR | DESPAIR | DESPAIR | DESPAIR | DESPAIR | DESPAIR |
1 | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | ... | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER |
2 | MEH | DESPAIR | JOY | MEH | MEH | DESPAIR | MEH | DESPAIR | DESPAIR | MEH | ... | DESPAIR | JOY | MEH | JOY | DESPAIR | JOY | JOY | DESPAIR | DESPAIR | DESPAIR |
3 | JOY | DESPAIR | JOY | DESPAIR | MEH | DESPAIR | MEH | DESPAIR | DESPAIR | MEH | ... | JOY | JOY | DESPAIR | JOY | MEH | JOY | JOY | DESPAIR | DESPAIR | JOY |
4 | JOY | DESPAIR | JOY | NO_ANSWER | NO_ANSWER | NO_ANSWER | MEH | MEH | DESPAIR | JOY | ... | JOY | JOY | MEH | JOY | DESPAIR | DESPAIR | JOY | DESPAIR | DESPAIR | JOY |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2300 | JOY | DESPAIR | MEH | DESPAIR | DESPAIR | MEH | MEH | DESPAIR | DESPAIR | MEH | ... | MEH | MEH | JOY | JOY | MEH | JOY | DESPAIR | MEH | DESPAIR | MEH |
2301 | MEH | DESPAIR | JOY | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | DESPAIR | DESPAIR | JOY | ... | MEH | MEH | DESPAIR | JOY | NO_ANSWER | NO_ANSWER | JOY | DESPAIR | MEH | JOY |
2302 | MEH | DESPAIR | JOY | DESPAIR | MEH | JOY | DESPAIR | MEH | MEH | DESPAIR | ... | JOY | JOY | MEH | MEH | MEH | JOY | MEH | DESPAIR | DESPAIR | MEH |
2303 | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | ... | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER | NO_ANSWER |
2304 | DESPAIR | DESPAIR | JOY | DESPAIR | DESPAIR | DESPAIR | DESPAIR | MEH | DESPAIR | JOY | ... | JOY | MEH | DESPAIR | DESPAIR | DESPAIR | JOY | DESPAIR | MEH | DESPAIR | JOY |
2305 rows × 103 columns
Exercise_AX: Use Panda's get_dummies()
to encode the candy_features
data, making sure to set drop_first=True
. Save this as candy_features_encoded
.
# Use Panda's get_dummies() to encode the candy_features data, making sure to set drop_first=True.
# Save this as candy_features_encoded
candy_features_encoded = pd.get_dummies((candy_features), drop_first=True)
candy_features_encoded
Q6 | 100 Grand Bar_JOY | Q6 | 100 Grand Bar_MEH | Q6 | 100 Grand Bar_NO_ANSWER | Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes)_JOY | Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes)_MEH | Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes)_NO_ANSWER | Q6 | Any full-sized candy bar_JOY | Q6 | Any full-sized candy bar_MEH | Q6 | Any full-sized candy bar_NO_ANSWER | Q6 | Black Jacks_JOY | ... | Q6 | Whatchamacallit Bars_NO_ANSWER | Q6 | White Bread_JOY | Q6 | White Bread_MEH | Q6 | White Bread_NO_ANSWER | Q6 | Whole Wheat anything_JOY | Q6 | Whole Wheat anything_MEH | Q6 | Whole Wheat anything_NO_ANSWER | Q6 | York Peppermint Patties_JOY | Q6 | York Peppermint Patties_MEH | Q6 | York Peppermint Patties_NO_ANSWER | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
2 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2300 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2301 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
2302 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2303 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
2304 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2305 rows × 309 columns
Code Check: Make sure that the candy
, candy_features
, candy_response
and candy_features_encoded
has an index that goes from 0 to n-1 or your final CodeGrade tests will not pass.
Great work! You have now cleaned your data and prepared it to be passed to a machine learning model.
I created models using Random Forest, Logistic Regression, and XGBoost algorithms, and they all returned around 70% accuracy rates. However, the other accuracy metrics (that you will learn about more in the machine learning classes) didn't look as good. Given the metrics that were calculated, I would say that based only on this data, using candy preference is not that great of an indicator of someone's gender.
Next Steps: Make sure that your notebook passes all the CodeGrade tests and then use this notebook to answer questions in the corresponding quiz in Brightspace.